<a href="https://colab.research.google.com/github/shakinul-islam/Python/blob/main/NLP_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
# Step 1: Install necessary libraries
!pip install nltk transformers --quiet

# Step 2: Import required libraries
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from transformers import AutoTokenizer

# Step 3: Download all required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')  # Additional resource needed for tokenization

# Step 4: Take 5 custom text inputs
data = []
print("🔹 Enter 5 sentences to preprocess:")
for i in range(5):
    sentence = input(f"Sentence {i+1}: ")
    data.append(sentence)

# Step 5: Define cleaning and preprocessing functions
def clean_text(text):
    text = re.sub(r'<.*?>', '', text)  # Remove HTML tags
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove special characters and digits
    return text

def preprocess(text):
    text = clean_text(text)
    text = text.lower()
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    return tokens

# Step 6: Apply preprocessing
preprocessed_data = [preprocess(text) for text in data]

# Step 7: Tokenization using BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
encoded_data = tokenizer(
    [" ".join(tokens) for tokens in preprocessed_data],
    padding="max_length",
    truncation=True,
    max_length=64,
    return_tensors="pt"
)

# Step 8: Show results
for i in range(5):
    print(f"\n🔹 Sentence {i+1}")
    print("Original Text:", data[i])
    print("Cleaned Tokens:", preprocessed_data[i])
    print("Token IDs:", encoded_data['input_ids'][i])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


🔹 Enter 5 sentences to preprocess:
Sentence 1: nice movie
Sentence 2: bad movie
Sentence 3: not bad this movie
Sentence 4: i love this acting
Sentence 5: it is very good movie


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]


🔹 Sentence 1
Original Text: nice movie
Cleaned Tokens: ['nice', 'movie']
Token IDs: tensor([ 101, 3835, 3185,  102,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0])

🔹 Sentence 2
Original Text: bad movie
Cleaned Tokens: ['bad', 'movie']
Token IDs: tensor([ 101, 2919, 3185,  102,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,