<a href="https://colab.research.google.com/github/tarakantaacharya/NLPinternal/blob/main/NLP_lab_internal_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Here are the steps to install and explore the features of **NLTK** and **spaCy**, as well as how to download word clouds and corpora:

### Step 1: Install NLTK and spaCy

1. **Install NLTK**:
   Run this command in your terminal or command prompt:
   ```bash
   pip install nltk
   ```

2. **Install spaCy**:
   Run this command in your terminal or command prompt:
   ```bash
   pip install spacy
   ```

3. **Install a spaCy Language Model** (for example, the English model):
   ```bash
   python -m spacy download en_core_web_sm
   ```

### Step 2: Exploring NLTK

1. **Import NLTK and Download Corpora**:
   After installation, you can import NLTK and download necessary corpora.
   
   ```python
   import nltk
   nltk.download('punkt')  # For tokenization
   nltk.download('stopwords')  # For stop words
   nltk.download('wordnet')  # For lemmatization
   nltk.download('movie_reviews')  # Example of a corpus
   ```

2. **Explore NLTK Features**:
   - **Tokenization**:
     ```python
     from nltk.tokenize import word_tokenize
     text = "NLTK is a leading platform for building Python programs to work with human language data."
     tokens = word_tokenize(text)
     print(tokens)
     ```

   - **Stopwords**:
     ```python
     from nltk.corpus import stopwords
     stop_words = set(stopwords.words('english'))
     print(stop_words)
     ```

   - **Lemmatization**:
     ```python
     from nltk.stem import WordNetLemmatizer
     lemmatizer = WordNetLemmatizer()
     print(lemmatizer.lemmatize("running", pos='v'))  # Verb lemmatization
     ```

   - **Word Frequency Distribution**:
     ```python
     from nltk.probability import FreqDist
     fdist = FreqDist(tokens)
     fdist.plot()  # To plot word frequency distribution
     ```

### Step 3: Exploring spaCy

1. **Import and Load spaCy Language Model**:
   ```python
   import spacy
   nlp = spacy.load('en_core_web_sm')
   ```

2. **Explore spaCy Features**:
   - **Tokenization**:
     ```python
     doc = nlp("spaCy is a powerful NLP library.")
     for token in doc:
         print(token.text)
     ```

   - **Part-of-Speech Tagging**:
     ```python
     for token in doc:
         print(token.text, token.pos_)
     ```

   - **Named Entity Recognition**:
     ```python
     for ent in doc.ents:
         print(ent.text, ent.label_)
     ```

   - **Dependency Parsing**:
     ```python
     for token in doc:
         print(token.text, token.dep_, token.head.text)
     ```

### Step 4: Installing and Generating Word Cloud

1. **Install WordCloud**:
   ```bash
   pip install wordcloud
   ```

2. **Generate Word Cloud**:
   ```python
   from wordcloud import WordCloud
   import matplotlib.pyplot as plt

   # Sample text for the word cloud
   text = "Python is a great language for machine learning and natural language processing."

   # Create the word cloud
   wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

   # Display the word cloud
   plt.imshow(wordcloud, interpolation='bilinear')
   plt.axis('off')
   plt.show()
   ```

### Step 5: Downloading Corpora in NLTK

1. **List Available Corpora**:
   ```python
   nltk.corpus.reader.__all__
   ```

2. **Download Additional Corpora**:
   You can download various corpora like `gutenberg`, `reuters`, `abc`, etc., by running:
   ```python
   nltk.download('gutenberg')
   nltk.download('reuters')
   ```

By following these steps, you should be able to explore the basic functionalities of NLTK and spaCy, work with corpora, and generate a word cloud.

#Week 2




In [None]:
import nltk
nltk.download('punkt_tab')  # Required for tokenizers
from nltk.tokenize import word_tokenize, sent_tokenize

# Sample text with multiple paragraphs
# text=input("enter text:\n")
text = '''Natural Language Processing (NLP) helps computers understand human language.

Tokenization breaks text into words and sentences for analysis.'''

# Word Tokenizer
def word_tokenizer(text):
    words = word_tokenize(text)
    return words

# Sentence Tokenizer
def sentence_tokenizer(text):
    sentences = sent_tokenize(text)
    return sentences

# Paragraph Tokenizer
def paragraph_tokenizer(text):
    paragraphs = text.split("\n\n")  # Splitting based on double newline characters
    return paragraphs

# Using the tokenizers
# print("Original Text:")
# print(text, "\n")

# Paragraph Tokenization
paragraphs = paragraph_tokenizer(text)
print("Paragraphs:\n",paragraphs)
for i, para in enumerate(paragraphs, 1):
    print(f"Paragraph {i}: {para}\n")

# Sentence Tokenization
sentences = sentence_tokenizer(text)
print("Sentences:\n",sentences)
for i, sentence in enumerate(sentences, 1):
    print(f"Sentence {i}: {sentence}")

# Word Tokenization
words = word_tokenizer(text)
print("\nWords:\n",words)
# print(words)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Paragraphs:
 ['Natural Language Processing (NLP) helps computers understand human language.', 'Tokenization breaks text into words and sentences for analysis.']
Paragraph 1: Natural Language Processing (NLP) helps computers understand human language.

Paragraph 2: Tokenization breaks text into words and sentences for analysis.

Sentences:
 ['Natural Language Processing (NLP) helps computers understand human language.', 'Tokenization breaks text into words and sentences for analysis.']
Sentence 1: Natural Language Processing (NLP) helps computers understand human language.
Sentence 2: Tokenization breaks text into words and sentences for analysis.

Words:
 ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'helps', 'computers', 'understand', 'human', 'language', '.', 'Tokenization', 'breaks', 'text', 'into', 'words', 'and', 'sentences', 'for', 'analysis', '.']


In [None]:
import nltk
from nltk.corpus import brown

# Download necessary NLTK resources
nltk.download('brown')  # Download Brown corpus if not already downloaded

# Select a corpus (example: Brown corpus)
corpus_words = brown.words()

# Calculate total words and unique words
total_words = len(corpus_words)  # Total number of words
distinct_words = len(set(corpus_words))  # Total unique words (using set)

# Print results
print(f"Total words in the corpus: {total_words}")
print(f"Number of distinct words in the corpus: {distinct_words}")

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


Total words in the corpus: 1161192
Number of distinct words in the corpus: 56057


#Week3

In [None]:
import nltk
from nltk.util import ngrams
from nltk.tokenize import word_tokenize
from collections import Counter

nltk.download('punkt_tab')

def text_preprocessing(text):
    # Convert to lowercase
    text = text.lower()
    # Tokenize into words
    tokens = nltk.word_tokenize(text)
    return tokens

def generate_ngrams(text, n):
    tokens = text_preprocessing(text)
    # Generate n-grams
    n_grams = list(ngrams(tokens, n))
    # Count frequencies
    ngram_freq = Counter(n_grams)
    return n_grams, ngram_freq

# Example text
text = "Natural language processing is an exciting field of study."

# Generate unigrams
print("=== UNIGRAMS ===")
unigrams, unigram_freq = generate_ngrams(text, 1)
print("\nFirst 10 unigrams:", unigrams[:10])
print("\nTop 10 most common unigrams:")
for gram, count in unigram_freq.most_common(10):
    print(f"{gram}: {count}")

# Generate bigrams
print("\n=== BIGRAMS ===")
bigrams, bigram_freq = generate_ngrams(text, 2)
print("\nFirst 10 bigrams:", bigrams[:10])
print("\nTop 10 most common bigrams:")
for gram, count in bigram_freq.most_common(10):
    print(f"{gram}: {count}")

# Generate trigrams
print("\n=== TRIGRAMS ===")
trigrams, trigram_freq = generate_ngrams(text, 3)
print("\nFirst 10 trigrams:", trigrams[:10])
print("\nTop 10 most common trigrams:")
for gram, count in trigram_freq.most_common(10):
    print(f"{gram}: {count}")

=== UNIGRAMS ===

First 10 unigrams: [('natural',), ('language',), ('processing',), ('is',), ('an',), ('exciting',), ('field',), ('of',), ('study',), ('.',)]

Top 10 most common unigrams:
('natural',): 1
('language',): 1
('processing',): 1
('is',): 1
('an',): 1
('exciting',): 1
('field',): 1
('of',): 1
('study',): 1
('.',): 1

=== BIGRAMS ===

First 10 bigrams: [('natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'an'), ('an', 'exciting'), ('exciting', 'field'), ('field', 'of'), ('of', 'study'), ('study', '.')]

Top 10 most common bigrams:
('natural', 'language'): 1
('language', 'processing'): 1
('processing', 'is'): 1
('is', 'an'): 1
('an', 'exciting'): 1
('exciting', 'field'): 1
('field', 'of'): 1
('of', 'study'): 1
('study', '.'): 1

=== TRIGRAMS ===

First 10 trigrams: [('natural', 'language', 'processing'), ('language', 'processing', 'is'), ('processing', 'is', 'an'), ('is', 'an', 'exciting'), ('an', 'exciting', 'field'), ('exciting', 'field', 'of'), 

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [16]:
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter

nltk.download('punkt_tab')

text = """Natural language processing is a field of artificial intelligence.
It deals with the interaction between computers and humans using natural language.
Processing includes tasks such as tokenization, parsing, and sentiment analysis.
Understanding language is crucial for applications like chatbots, translation, and information retrieval.
Language processing helps in deriving meaning from text and is an essential part of modern AI systems."""

def most_probable_next_word(text, w1):
    words = word_tokenize(text)
    bigrams = Counter(nltk.bigrams(words))
    following_words = {w2: count for (prev, w2), count in bigrams.items() if prev == w1}
    return max(following_words, key=following_words.get, default=None), sum(following_words.values())

w1 = input("Enter a word: ")
w2, count = most_probable_next_word(text, w1)

if w2:
    print(f"'{w2}' is most likely to follow '{w1}' with frequency {count}.")
else:
    print(f"No words found after '{w1}'.")

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Enter a word: language
'processing' is most likely to follow 'language' with frequency 3.


In [17]:
def function(text,w1) :
  following_words = {}
  words = word_tokenize(text)
  bigrams = Counter(nltk.bigrams(words))
  for (prev, w2), count in bigrams.items():
    if prev == w1 :
      following_words[w2] = count
  most_freq_word = None
  max_count = -1
  for w,c in following_words.items():
    if c > max_count :
      max_count = c
      most_freq_word = w
  most_freq = sum(following_words.values())
  return most_freq_word,most_freq

#Week 4

In [None]:
import nltk
from nltk.collocations import BigramCollocationFinder
from nltk.collocations import BigramAssocMeasures
from nltk.corpus import stopwords

nltk.download('stopwords')

# Sample text
text = "Machine learning is a fascinating field of artificial intelligence. \
It allows computers to learn from data and make predictions. \
Deep learning, a subset of machine learning, focuses on neural networks."

# Tokenization
words = nltk.word_tokenize(text)

stop_words = set(stopwords.words('english'))
words = [word for word in words if word.lower() not in stop_words]

# Finding bigram collocations
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(words)

# Get top 5 collocations based on PMI
collocations = finder.nbest(bigram_measures.pmi, 5)
print(collocations)

[nltk_data] Downloading package stopwords to /root/nltk_data...


[('allows', 'computers'), ('artificial', 'intelligence'), ('computers', 'learn'), ('data', 'make'), ('fascinating', 'field')]


[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
import nltk
from nltk.tokenize import word_tokenize

# Download necessary NLTK resources
nltk.download('punkt')

words_with_prefix = []

# Function to find words beginning with a given prefix
def find_words_starting_with(text, prefix):
    # Tokenize the text into words
    words = word_tokenize(text.lower())  # Convert text to lowercase for case-insensitive matching

    # Filter words starting with the given prefix
    for word in words :
      if word.startswith(prefix.lower()):
        words_with_prefix.append(word)

    return words_with_prefix

# Example: Find all words starting with 'na'
prefix = input("enter prefix:")
words_with_prefix = find_words_starting_with(text, prefix)

# Print the words
print(f"Words starting with '{prefix}':")
print(words_with_prefix)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


enter prefix:mac
Words starting with 'mac':
['machine', 'machine']


In [None]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
# text = """
"Natural Language Processing (NLP) is an important field of artificial intelligence. NLP techniques are used to process human languages for various applications . Machine learning and deep learning models have greatly improved NLP capabilities, making it a powerful tool."
# """
#text=input("enter text:")
def long_words(text, min_length=4):
    words = word_tokenize(text)
    lon_words = [word for word in words if len(word) > min_length]
    return lon_words
words = long_words(text)
print(f"Words longer than four characters:")
print(words)

Words longer than four characters:
['Natural', 'language', 'processing', 'field', 'artificial', 'intelligence', 'deals', 'interaction', 'between', 'computers', 'humans', 'using', 'natural', 'language', 'Processing', 'includes', 'tasks', 'tokenization', 'parsing', 'sentiment', 'analysis', 'Understanding', 'language', 'crucial', 'applications', 'chatbots', 'translation', 'information', 'retrieval', 'Language', 'processing', 'helps', 'deriving', 'meaning', 'essential', 'modern', 'systems']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#Week5

In [None]:
import re

def find_math_expressions(sentence):
    # Refined regular expression to match mathematical expressions
    math_expression_pattern = r'[A-Za-z\d]+(?:\s*[\+\-\/\^\=]\s[A-Za-z\d]+)+'

    # Find all matches in the sentence
    math_expressions = re.findall(math_expression_pattern, sentence)

    return math_expressions

# Example input
sentence = input("Enter a sentence: ")

# Identify mathematical expressions
math_expressions = find_math_expressions(sentence)

if math_expressions:
    print("Mathematical expressions found:", math_expressions)
else:
    print("No mathematical expressions found.")
#Enter a sentence: The area of a circle is given by the formula A = pi * r^2. Also, 3 + 5 = 8 is true.

Enter a sentence: The area of a circle is given by the formula A = pi * r^2. Also, 3 + 5 = 8 is true
Mathematical expressions found: ['A = pi', '3 + 5 = 8']


In [None]:
import re

def extract_email_components(email):
    # Regular expression to match an email address
    email_pattern = r'^([a-zA-Z0-9._%+-]+)@([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,})$'

    match = re.match(email_pattern, email)
    if match:
        local_part = match.group(1)
        domain = match.group(2)
        top_level_domain = match.group(3)
        return local_part, domain, top_level_domain
    else:
        return None

# Input from the user
email = input("Enter an email address: ")

# Extract components
components = extract_email_components(email)

if components:
    print(f"Local part: {components[0]}")
    print(f"Domain: {components[1]}")
    print(f"Top-level domain: {components[2]}")
else:
    print("Invalid email address format.")

Enter an email address: 322103382058@gvpce.ac.in
Local part: 322103382058
Domain: gvpce.ac
Top-level domain: in


#Week 6

In [None]:
import nltk
from nltk.corpus import wordnet as wn

# Download necessary resources
nltk.download('wordnet')
nltk.download('omw-1.4')

def get_synonyms_antonyms(word):
    # Get synsets (synonym sets) for the word
    synsets = wn.synsets(word)

    synonyms = set()
    antonyms = set()

    for synset in synsets:
        # Add synonyms to the set
        for lemma in synset.lemmas():
            synonyms.add(lemma.name())

            # Add antonyms to the set
            if lemma.antonyms():
                antonyms.add(lemma.antonyms()[0].name())

    return list(synonyms), list(antonyms)

# Input word from user
word = input("Enter a word: ")

# Get synonyms and antonyms
synonyms, antonyms = get_synonyms_antonyms(word)

if synonyms:
    print(f"Synonyms of {word}: {', '.join(synonyms)}")
else:
    print(f"No synonyms found for {word}.")

if antonyms:
    print(f"Antonyms of {word}: {', '.join(antonyms)}")
else:
    print(f"No antonyms found for {word}.")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Enter a word: natural
Synonyms of natural: cancel, instinctive, lifelike, rude, natural, raw, innate, born
Antonyms of natural: supernatural, sharp, artificial, unnatural


In [None]:
import nltk
from nltk.corpus import wordnet as wn

# Download necessary resources
nltk.download('wordnet')
nltk.download('omw-1.4')

def get_hyponyms(word):
    synsets = wn.synsets(word)
    hyponyms = set()
    for synset in synsets:
        for hyponym in synset.hyponyms():
            hyponyms.add(hyponym.name().split('.')[0])  # Get the word part of the hyponym
    return list(hyponyms)

def get_homonyms(word):
    synsets = wn.synsets(word)
    homonyms = set()
    for synset in synsets:
        # Collect homonyms that have multiple meanings (different synsets)
        homonyms.add(synset.name().split('.')[0])  # Get the word part of the homonym
    return list(homonyms)

def get_polysemy(word):
    synsets = wn.synsets(word)
    return len(synsets)

# Main function to execute the program
def main():
    word = input("Enter a word: ")

    # Get hyponyms
    hyponyms = get_hyponyms(word)
    if hyponyms:
        print(f"Hyponyms of {word}: {', '.join(hyponyms)}")
    else:
        print(f"No hyponyms found for {word}.")

    # Get homonyms
    homonyms = get_homonyms(word)
    if len(homonyms) > 1:
        print(f"Homonyms of {word}: {', '.join(homonyms)}")
    else:
        print(f"No homonyms found for {word}.")

    # Get polysemy count
    polysemy_count = get_polysemy(word)
    print(f"{word} has {polysemy_count} meanings (Polysemy count).")

# Call the main function directly
main()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Enter a word: is
Hyponyms of is: begin, cox, stand, connect, incarnate, press, trim, distribute, deserve, rage, attend, jumble, iridesce, osculate, follow, be_well, coexist, form, buy, present, stink, fall, extend, head, confuse, flow, cost, clean, seethe, retard, supplement, stick, kill, lie, vet, mope, promise, stretch, belong, look, put_out, prove, impend, rate, rank, inhabit, sell, reach, consist, total, pay, cover, object, prevail, suit, compact, sit, come_in_handy, want, compare, sparkle, center_on, straddle, contain, bake, deck, owe, run_into, transplant, go, beat, equate, cut_across, loiter, compose, answer, suck, squat, underlie, end, tend, figure, recognize, dwell, feel, range, let_go, rut, cut, represent, breathe, point, test, preexist, litter, relate, kick_around, populate, fit, face, hoodoo, exemplify, come_in_for, set_back, disagree, stagnate, hail, hang, specify, account_for, suffer, endanger, remain, body, act, seem, work, match, stand_by, swing, wind, subtend, account,