## NLP Assignment

### Context
In this assignment, you will apply various Natural Language Processing techniques to analyze and process a given text using Python. You will utilize libraries such as NLTK and scikit-learn to perform these tasks.

### Sample Text
"In the 21st-century tech-landscape: AI (#Artificial_Intelligence), VR (Virtual Reality), and IoT (Internet of Things) are buzzwords. Yet, amidst this progress, issues such as e-waste, cyber-security, & ethical AI spark debates. It's a conundrum of 'innovate or perish', where every '@'mention' and 'like' on social media platforms can have unforeseen consequences."


In [22]:
import nltk
import string
from nltk.tokenize import word_tokenize

In [23]:
my_text ="In the 21st-century tech-landscape: AI (#Artificial_Intelligence), VR (Virtual Reality), and IoT (Internet of Things) are buzzwords. Yet, amidst this progress, issues such as e-waste, cyber-security, & ethical AI spark debates. It's a conundrum of 'innovate or perish', where every '@'mention' and 'like' on social media platforms can have unforeseen consequences."

**Task 1: Text Cleaning**
- Remove all punctuation from the text.
- Convert all text to lowercase.
- Remove all stopwords (use NLTK's predefined list of stopwords).


In [24]:
print(my_text)

In the 21st-century tech-landscape: AI (#Artificial_Intelligence), VR (Virtual Reality), and IoT (Internet of Things) are buzzwords. Yet, amidst this progress, issues such as e-waste, cyber-security, & ethical AI spark debates. It's a conundrum of 'innovate or perish', where every '@'mention' and 'like' on social media platforms can have unforeseen consequences.


## Step 1: converting to lower text

In [25]:
my_text =my_text.lower()

In [26]:
print(my_text)

in the 21st-century tech-landscape: ai (#artificial_intelligence), vr (virtual reality), and iot (internet of things) are buzzwords. yet, amidst this progress, issues such as e-waste, cyber-security, & ethical ai spark debates. it's a conundrum of 'innovate or perish', where every '@'mention' and 'like' on social media platforms can have unforeseen consequences.


## Step 2: Removing the punctuation

In [29]:
#removing the punctuation
my_text = my_text.translate(str.maketrans('','',string.punctuation))

In [30]:
my_text

'in the 21stcentury techlandscape ai artificialintelligence vr virtual reality and iot internet of things are buzzwords yet amidst this progress issues such as ewaste cybersecurity  ethical ai spark debates its a conundrum of innovate or perish where every mention and like on social media platforms can have unforeseen consequences'

In [31]:
from nltk.corpus import stopwords

In [32]:
stopwords = stopwords.words('english')


In [33]:
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [34]:
my_text_tokenized = word_tokenize(my_text)

In [35]:
print(my_text_tokenized)

['in', 'the', '21stcentury', 'techlandscape', 'ai', 'artificialintelligence', 'vr', 'virtual', 'reality', 'and', 'iot', 'internet', 'of', 'things', 'are', 'buzzwords', 'yet', 'amidst', 'this', 'progress', 'issues', 'such', 'as', 'ewaste', 'cybersecurity', 'ethical', 'ai', 'spark', 'debates', 'its', 'a', 'conundrum', 'of', 'innovate', 'or', 'perish', 'where', 'every', 'mention', 'and', 'like', 'on', 'social', 'media', 'platforms', 'can', 'have', 'unforeseen', 'consequences']


### 3.Removing stop words

In [36]:
my_text_filtered_stop_words= []
for token in my_text_tokenized:
    if token not in stopwords:
        my_text_filtered_stop_words.append(token)
print(my_text_filtered_stop_words)

['21stcentury', 'techlandscape', 'ai', 'artificialintelligence', 'vr', 'virtual', 'reality', 'iot', 'internet', 'things', 'buzzwords', 'yet', 'amidst', 'progress', 'issues', 'ewaste', 'cybersecurity', 'ethical', 'ai', 'spark', 'debates', 'conundrum', 'innovate', 'perish', 'every', 'mention', 'like', 'social', 'media', 'platforms', 'unforeseen', 'consequences']


In [37]:
print(my_text_filtered_stop_words)

['21stcentury', 'techlandscape', 'ai', 'artificialintelligence', 'vr', 'virtual', 'reality', 'iot', 'internet', 'things', 'buzzwords', 'yet', 'amidst', 'progress', 'issues', 'ewaste', 'cybersecurity', 'ethical', 'ai', 'spark', 'debates', 'conundrum', 'innovate', 'perish', 'every', 'mention', 'like', 'social', 'media', 'platforms', 'unforeseen', 'consequences']


In [47]:
filtered_words_string = ' '.join(my_text_filtered_stop_words)

In [48]:
print(filtered_words_string)

21stcentury techlandscape ai artificialintelligence vr virtual reality iot internet things buzzwords yet amidst progress issues ewaste cybersecurity ethical ai spark debates conundrum innovate perish every mention like social media platforms unforeseen consequences


**Task 2: Custom Tokenization**
- Use NLTK's `RegexpTokenizer` to create a custom tokenizer that only captures words of three or more letters.
- Compare the results of your custom tokenization with the default tokenization behavior of `CountVectorizer` or `TfidfVectorizer` from `scikit-learn`.


In [38]:
from nltk.tokenize import RegexpTokenizer

In [51]:
# defining a pattern to capture three or more words
pattern = r'\b\w{3,}\b'

custom_tokenizer =RegexpTokenizer(pattern)
# Tokenize the preprocessed text using the custom tokenizer
tokens_custom= custom_tokenizer.tokenize(filtered_words_string)

print('Tokens after custom tokenization:\n',tokens_custom)


tokens after custom tokenization:
 ['21stcentury', 'techlandscape', 'artificialintelligence', 'virtual', 'reality', 'iot', 'internet', 'things', 'buzzwords', 'yet', 'amidst', 'progress', 'issues', 'ewaste', 'cybersecurity', 'ethical', 'spark', 'debates', 'conundrum', 'innovate', 'perish', 'every', 'mention', 'like', 'social', 'media', 'platforms', 'unforeseen', 'consequences']


In [54]:
from sklearn.feature_extraction.text import CountVectorizer


In [57]:
vectorizer = CountVectorizer()

In [60]:
# Fit and transform the preprocessed text
X = vectorizer.fit_transform(my_text_filtered_stop_words) # this methods expect to take list not string

# Get the feature names (tokens)
feature_names = vectorizer.get_feature_names_out()

# Print the feature names
print(feature_names)

['21stcentury' 'ai' 'amidst' 'artificialintelligence' 'buzzwords'
 'consequences' 'conundrum' 'cybersecurity' 'debates' 'ethical' 'every'
 'ewaste' 'innovate' 'internet' 'iot' 'issues' 'like' 'media' 'mention'
 'perish' 'platforms' 'progress' 'reality' 'social' 'spark'
 'techlandscape' 'things' 'unforeseen' 'virtual' 'vr' 'yet']


In [68]:
# comparing the result of custom tokenization with countvectorizer
print('numbers of tokens generated from custom tokenization is:',len(tokens_custom))
print('numbers of tokens generated from countvectorizer tokenization is:',len(feature_names))


print('tokens that are generated by countvectorizer but not by custom tokenizer are:')
for token in feature_names:
    if token not in tokens_custom:
        print(token)

        
# for token in tokens_custom:
#     if token not in feature_names:
#         print(token)

numbers of tokens generated from custom tokenization is: 29
numbers of tokens generated from countvectorizer tokenization is: 31
tokens that are generated by countvectorizer but not by custom tokenizer are:
ai
vr


**Task 3: Bag of Words Model**
- Construct a Bag of Words model using your cleaned and tokenized text from Task 1 and Task 2, and display the word frequencies.


In [76]:
# constructing bow  and displaying the word frequencies:
X = vectorizer.fit_transform(tokens_custom)

print(X)

print('\n')


print(X.toarray())

# Get the feature names (words)
words = vectorizer.get_feature_names_out()

# Get the word frequencies
word_frequencies = X.toarray().sum(axis=0)

# Display word frequencies
for word, freq in zip(words, word_frequencies):
    print(f"{word}: {freq}")

  (0, 0)	1
  (1, 24)	1
  (2, 2)	1
  (3, 27)	1
  (4, 21)	1
  (5, 13)	1
  (6, 12)	1
  (7, 25)	1
  (8, 3)	1
  (9, 28)	1
  (10, 1)	1
  (11, 20)	1
  (12, 14)	1
  (13, 10)	1
  (14, 6)	1
  (15, 8)	1
  (16, 23)	1
  (17, 7)	1
  (18, 5)	1
  (19, 11)	1
  (20, 18)	1
  (21, 9)	1
  (22, 17)	1
  (23, 15)	1
  (24, 22)	1
  (25, 16)	1
  (26, 19)	1
  (27, 26)	1
  (28, 4)	1


[[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

**Task 4: Analyze Word Importance with TF-IDF**
- Apply `TfidfVectorizer` to the sample text to compute the TF-IDF scores for each word in the document.
- Comment on the differences between TF-IDF scores and word frequencies from the Bag of Words model.


In [77]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [79]:
# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the filtered text
tfidf_matrix = tfidf_vectorizer.fit_transform(feature_names)

# Get the feature names (words)
words = tfidf_vectorizer.get_feature_names_out()

# Get the TF-IDF scores
tfidf_scores = tfidf_matrix.toarray()

# Display TF-IDF scores for each word
for i, word in enumerate(words):
    print('\n')
    print(f"{word}: {tfidf_scores[:, i]}")



21stcentury: [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]


ai: [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]


amidst: [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]


artificialintelligence: [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]


buzzwords: [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]


consequences: [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]


conundrum: [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]


cybersecurity: [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]


debates: [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0.]


ethical

**Task 5: Part-of-Speech Tagging**
- Perform POS tagging on the cleaned and tokenized text.
- Identify the most common part of speech in the sample text.


In [84]:
from textblob import TextBlob
tb= TextBlob(filtered_words_string)
print(tb)

21stcentury techlandscape ai artificialintelligence vr virtual reality iot internet things buzzwords yet amidst progress issues ewaste cybersecurity ethical ai spark debates conundrum innovate perish every mention like social media platforms unforeseen consequences


In [88]:
from collections import Counter

# Perform POS tagging
tagged_words = tb.tags

# Extract POS tags
pos_tags = [tag for word, tag in tagged_words]

# Count the occurrences of each POS tag
pos_tag_counts = Counter(pos_tags)

# Find the most common POS tag
most_common_pos_tag = pos_tag_counts.most_common(1)[0]

# Display the most common POS tag
print("Most common part of speech:", most_common_pos_tag)

Most common part of speech: ('NN', 10)


The part of speech 'NN' indicates a noun, which is the most common part of speech in our text.
The number '10' represents the frequency of occurrences of this part of speech in our text.

**Task 6: Working with Synonyms and Antonyms**
- Choose three words from the text and use WordNet to find their synonyms and antonyms.
- Discuss the role of synonyms and antonyms in text analysis and how they could affect the interpretation of the text.


In [93]:
choosen_words =['virtual', 'reality','progress']
from nltk.corpus import wordnet

In [99]:
# Function to find synonyms and antonyms
def find_synonyms_antonyms(word):
    # Synonyms and antonyms lists
    synonyms = []
    antonyms = []

    # Iterate over each synset of the word in WordNet
    for synset in wordnet.synsets(word):
        for lemma in synset.lemmas():
            # Add synonyms
            synonyms.append(lemma.name())

            # Add antonyms
            if lemma.antonyms():
                antonyms.append(lemma.antonyms()[0].name())

    # Return the sets of synonyms and antonyms
    return set(synonyms), set(antonyms)

# Word to find synonyms and antonyms for
word = 'virtual'

# Call the function
synonyms, antonyms = find_synonyms_antonyms(word)

# Print the synonyms and antonyms
print(f"Word: {word}")
print(f"Synonyms: {synonyms}")
print(f"Antonyms: {antonyms}")

# Word to find synonyms and antonyms for
word = 'reality'

# Call the function
synonyms, antonyms = find_synonyms_antonyms(word)

# Print the synonyms and antonyms
print(f"Word: {word}")
print(f"Synonyms: {synonyms}")
print(f"Antonyms: {antonyms}")

# Word to find synonyms and antonyms for
word = 'progress'

# Call the function
synonyms, antonyms = find_synonyms_antonyms(word)

# Print the synonyms and antonyms
print(f"Word: {word}")
print(f"Synonyms: {synonyms}")
print(f"Antonyms: {antonyms}")

Word: virtual
Synonyms: {'virtual', 'practical'}
Antonyms: set()
Word: reality
Synonyms: {'realism', 'world', 'reality', 'realness'}
Antonyms: {'unreality'}
Word: progress
Synonyms: {'pass_on', 'go_on', 'get_along', 'build', 'shape_up', 'advance', 'forward_motion', 'get_on', 'onward_motion', 'work_up', 'come_on', 'come_along', 'progression', 'march_on', 'move_on', 'advancement', 'progress', 'procession', 'build_up'}
Antonyms: {'recede', 'retreat', 'regress'}


**Task 7: Text Normalization**
- Apply stemming and lemmatization to the cleaned and tokenized text.
- Compare the results and discuss the circumstances where each method would be preferred over the other.


In [102]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

tokens = word_tokenize(filtered_words_string)

porter_stemmer = PorterStemmer()

# Initialize WordNetLemmatizer for lemmatization
wordnet_lemmatizer = WordNetLemmatizer()

# Perform stemming and lemmatization
stemmed_words = [porter_stemmer.stem(word) for word in tokens]
lemmatized_words = [wordnet_lemmatizer.lemmatize(word) for word in tokens]

# Print stemmed and lemmatized words
print("Stemmed words:", stemmed_words)
print('\n')
print("Lemmatized words:", lemmatized_words)

Stemmed words: ['21stcenturi', 'techlandscap', 'ai', 'artificialintellig', 'vr', 'virtual', 'realiti', 'iot', 'internet', 'thing', 'buzzword', 'yet', 'amidst', 'progress', 'issu', 'ewast', 'cybersecur', 'ethic', 'ai', 'spark', 'debat', 'conundrum', 'innov', 'perish', 'everi', 'mention', 'like', 'social', 'media', 'platform', 'unforeseen', 'consequ']


Lemmatized words: ['21stcentury', 'techlandscape', 'ai', 'artificialintelligence', 'vr', 'virtual', 'reality', 'iot', 'internet', 'thing', 'buzzword', 'yet', 'amidst', 'progress', 'issue', 'ewaste', 'cybersecurity', 'ethical', 'ai', 'spark', 'debate', 'conundrum', 'innovate', 'perish', 'every', 'mention', 'like', 'social', 'medium', 'platform', 'unforeseen', 'consequence']


Lemmatization and stemming are both techniques used to reduce words to their base forms, but they differ in their approaches and outcomes. Lemmatization preserves the semantic meaning of words by reducing them to their dictionary form, accounting for variations in word morphology and part of speech. In contrast, stemming applies simpler rules to truncate words to their root form, often resulting in less accurate reductions and potential loss of semantic meaning. While lemmatization offers higher precision and linguistic accuracy, stemming is preferred in scenarios where computational efficiency and simplicity are prioritized over linguistic precision.

**Task 8: Spell Correction**
- Implement a spell correction algorithm or use a library to correct the misspelled words in the sample text.
- Discuss how spell correction can impact the sentiment or meaning of the text.


In [105]:
from autocorrect import Speller

speller = Speller()

# Correct misspelled words
corrected_text = speller(filtered_words_string)

# Print the corrected text
print(filtered_words_string)
print('\n')
print("Corrected text:", corrected_text)

21stcentury techlandscape ai artificialintelligence vr virtual reality iot internet things buzzwords yet amidst progress issues ewaste cybersecurity ethical ai spark debates conundrum innovate perish every mention like social media platforms unforeseen consequences


Corrected text: 21century techlandscape ai artificialintelligence vr virtual reality iot internet things buzzwords yet amidst progress issues waste cybersecurity ethical ai spark debates conundrum innovate perish every mention like social media platforms unforeseen consequences


**Task 9: Similarity Measurement**
- Using WordNet synsets, calculate the semantic similarity between the different uses of the word "bank" in the context of a financial institution and the side of a river.
- Explain how semantic similarity can be useful in disambiguating word meanings in NLP.


In [106]:
# from nltk.corpus import wordnet

# Define the two senses of the word "bank"
financial_bank = wordnet.synsets('bank', pos='n')[0]  # Synset for financial institution
river_bank = wordnet.synsets('bank', pos='n')[1]      # Synset for side of a river

# Calculate the Wu-Palmer Similarity
similarity_score = financial_bank.wup_similarity(river_bank)

# Print the semantic similarity score
print("Semantic similarity score:", similarity_score)


Semantic similarity score: 0.14285714285714285


**Task 10: Fixing Word Lengthening**
- Write a function to identify words with characters repeated more than twice in a row and shorten the repetition to two characters.
- Discuss the impact of correcting lengthened words on text analysis and sentiment analysis.


In [107]:
import re

def shorten_repeated_chars(word):
    # Use regular expression to find repeated characters more than twice
    shortened_word = re.sub(r'(.)\1{2,}', r'\1\1', word)
    return shortened_word

# Test the function with the word "woooow" and "goooal"
word1 = "woooow"
word2 = "goooal"

# Shorten the repeated characters
shortened_word1 = shorten_repeated_chars(word1)
shortened_word2 = shorten_repeated_chars(word2)

# Print the shortened words
print("Original word:", word1)
print("Shortened word:", shortened_word1)

print("\nOriginal word:", word2)
print("Shortened word:", shortened_word2)


Original word: woooow
Shortened word: woow

Original word: goooal
Shortened word: gooal
