1. write a python program that reads a para and:  
    a. tokenizes the text into words;  
    b. removes punctuations and converts all words to lowercase;  
    c. performs stemming and lemmatization.

In [None]:
import re
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:

def tokenize(para):
    # a. tokenizes the text into words;
    tokens = nltk.word_tokenize(para)

    # b. removes punctuations and converts all words to lowercase;
    lowered = [i.lower() for i in tokens if i.isalpha()]

    # c. performs stemming and lemmatization.
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(i) for i in lowered]

    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(i) for i in lowered]

    return tokens, lowered, stemmed_tokens, lemmatized_tokens

In [None]:
paragraph = input("Please enter a paragraph: ")

tokens, lowered, stemmed_words, lemmatized_words = tokenize(paragraph)

print("Tokens: ", tokens)
print("Lower-case after removing punctuations: ", lowered)
print("Stemmed words:", stemmed_words)
print("Lemmatized words:", lemmatized_words)

Please enter a paragraph: The cat (Felis catus), also referred to as the domestic cat or house cat, is a small domesticated carnivorous mammal. It is the only domesticated species of the family Felidae. Advances in archaeology and genetics have shown that the domestication of the cat occurred in the Near East around 7500 BC. It is commonly kept as a pet and working cat, but also ranges freely as a feral cat avoiding human contact. It is valued by humans for companionship and its ability to kill vermin. Its retractable claws are adapted to killing small prey species such as mice and rats. It has a strong, flexible body, quick reflexes, and sharp teeth, and its night vision and sense of smell are well developed. It is a social species, but a solitary hunter and a crepuscular predator.
Tokens:  ['The', 'cat', '(', 'Felis', 'catus', ')', ',', 'also', 'referred', 'to', 'as', 'the', 'domestic', 'cat', 'or', 'house', 'cat', ',', 'is', 'a', 'small', 'domesticated', 'carnivorous', 'mammal', '.'

2. Extract digits, Phone number, and email id from give sentence.

In [None]:
import re

def extract_info(text):

    # Extract all digits
    digits = re.findall(r'\d', text)

    phone_numbers = re.findall(r'\b(?:\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\d{10,11})\b', text)

    gmail_addresses = re.findall(r'\b[a-zA-Z0-9._%+-]+@gmail\.com\b', text)

    return {
        "digits": digits,
        "phone_numbers": phone_numbers,
        "gmail_addresses": gmail_addresses
    }

text = "My phone number is 4867473357, or you can call me at 8143586475. My email is sillycat123@gmail.com. Another email is hehethegoat@gmail.com."

extracted_data = extract_info(text)

print("Extracted Digits:", extracted_data["digits"])
print("Extracted Phone Numbers:", extracted_data["phone_numbers"])
print("Extracted Gmail Addresses:", extracted_data["gmail_addresses"])

Extracted Digits: ['4', '8', '6', '7', '4', '7', '3', '3', '5', '7', '8', '1', '4', '3', '5', '8', '6', '4', '7', '5', '1', '2', '3']
Extracted Phone Numbers: ['4867473357', '8143586475']
Extracted Gmail Addresses: ['sillycat123@gmail.com', 'hehethegoat@gmail.com']


3. Implement simple rule based tokenizer for the English language, using regular expressions. The tokenizer should consider punctuations and special symbols as separate tokens. contractions like 'isn't' should be regarded as two tokens: is and n't, also identify abbreviations(eg: USA) and internal *hyphenation*(ice-cream) as single tokens


In [None]:
import re

def rule_based_tokenizer(text):
    pattern = re.compile(
        r"([a-zA-Z]+n\'t)"                        # contractions like isn't
        r"|([A-Z]{2,})"                           # abbreviations
        r"|([a-zA-Z]+-[a-zA-Z]+)"                 # hyphenated words
        r"|(\$[\d\.]+)"                           # money
        r"|(\w+)"                                 # normal words
        r"|(\S)"                                  # punctuation, etc.
    )

    tokens = []
    for match in re.finditer(pattern, text):
        matched_token = next((group for group in match.groups() if group), match.group(0))
        tokens.append(matched_token)

    # isn't → is, n't
    final_tokens = []
    contraction_pattern = re.compile(r"([a-zA-Z]+)(n\'t)")

    for token in tokens:
        match = contraction_pattern.match(token)
        if match:
            final_tokens.extend([match.group(1), match.group(2)])
        else:
            final_tokens.append(token)

    return final_tokens

text = "won't you tell me how ice-cream is a delicious thing..?"
# text = input("Enter Sentence: ")
tokens = rule_based_tokenizer(text)
print(tokens)


['wo', "n't", 'you', 'tell', 'me', 'how', 'ice-cream', 'is', 'a', 'delicious', 'thing', '.', '.', '?']


4. Implement a text classifier for sentiment analysis using Naive Bayes Theorem

In [7]:
import nltk
nltk.download('movie_reviews')
nltk.download('stopwords')

from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import string
import random

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

preprocessed_documents = []
for words, category in documents:
    # convert to lowercase and remove punctuation
    words = [word.lower() for word in words if word not in string.punctuation]

    # remove stop words and stem
    words = [stemmer.stem(word) for word in words if word not in stop_words]

    preprocessed_documents.append((words, category))

preprocessed_text = [doc[0] for doc in preprocessed_documents]
labels = [doc[1] for doc in preprocessed_documents]

print("Number of documents:", len(preprocessed_documents))
print("Example preprocessed document:", preprocessed_documents[0])

X_train, X_test, y_train, y_test = train_test_split(preprocessed_text, labels, test_size=0.2, random_state=42)

print("Number of samples in training set:", len(X_train))
print("Number of samples in testing set:", len(X_test))

vectorizer = TfidfVectorizer()

X_train_str = [' '.join(words) for words in X_train]
X_test_str = [' '.join(words) for words in X_test]

X_train_tfidf = vectorizer.fit_transform(X_train_str)

X_test_tfidf = vectorizer.transform(X_test_str)

print("Shape of X_train_tfidf:", X_train_tfidf.shape)
print("Shape of X_test_tfidf:", X_test_tfidf.shape)

nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_tfidf, y_train)
y_pred = nb_classifier.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision (weighted): {precision:.4f}")
print(f"Recall (weighted): {recall:.4f}")
print(f"F1-score (weighted): {f1:.4f}")

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Number of documents: 2000
Example preprocessed document: (['sam', 'matthew', 'broderick', 'astronom', 'small', 'american', 'town', 'engag', 'teacher', 'linda', 'kelli', 'preston', 'head', 'heel', 'love', 'linda', 'sudden', 'departur', 'new', 'york', 'citi', 'live', 'new', 'lover', 'anton', 'tch', 'ky', 'karyo', 'come', 'complet', 'surpris', 'know', 'love', 'blind', 'sam', 'leav', 'new', 'york', 'well', 'win', 'back', 'move', 'abandon', 'hous', 'across', 'street', 'anton', 'apart', 'instal', 'camera', 'obscura', 'watch', 'suddenli', 'anton', 'ex', 'maggi', 'show', 'want', 'former', 'lover', 'back', 'contrari', 'want', 'vapor', 'may', 'say', 'want', 'kill', 'possibl', 'bother', 'much', 'either', 'peopl', 'die', 'everi', 'day', 'spite', 'differ', 'motiv', 'two', 'team', 'fall', 'thing', 'move', 'stori', 'sound', 'like', 'someth', 'seen', 'million', 'time', 'griffin', 'dunn', 'bring', 'us', 'charm', 'comedi', 'sinc', 'sleep', 'camera', 'obscura', 'add', 'special', 'someth', 'movi', 'make',

Analyzing by giving an input sentence and checking if its positive or negative.

In [21]:
def preprocess_input(text):
    words = [word.lower() for word in text.split() if word not in string.punctuation]
    words = [stemmer.stem(word) for word in words if word not in stop_words]
    return ' '.join(words)

sentence = input("Enter a sentence to analyze its sentiment: ")

preprocessed_sentence = preprocess_input(sentence)

# vectorizing
sentence_tfidf = vectorizer.transform([preprocessed_sentence])

# predicitng
predicted_sentiment = nb_classifier.predict(sentence_tfidf)

print(f"The sentiment of the sentence is: {predicted_sentiment[0]}")

Enter a sentence to analyze its sentiment: this movie was the last thing i wanted to see
The sentiment of the sentence is: neg
