## Exercise 1: Part-of-Speech Tagging with NLTK

**Task**: Identify the grammatical category (Part-of-Speech) for each word in a given sentence.
**Library**: NLTK (Natural Language Toolkit) is widely used for linguistic annotation tasks.

In [2]:
import nltk
nltk.download('averaged_perceptron_tagger_eng') # Required for POS tagging for English
from nltk.tokenize import word_tokenize

print("\n--- Exercise 1: Part-of-Speech Tagging with NLTK ---")
text5 = "The quick brown fox jumps over the lazy dog."
tokens5 = word_tokenize(text5)
pos_tags = nltk.pos_tag(tokens5) # Perform POS tagging on the tokens
print(f"Original text: '{text5}'")
print(f"Tokens: {tokens5}")
print(f"POS Tags: {pos_tags}")
# identify the grammatical categories of each word in the sentence
for word, tag in pos_tags:
    print(f"Word: {word}, POS Tag: {tag}")


--- Exercise 1: Part-of-Speech Tagging with NLTK ---
Original text: 'The quick brown fox jumps over the lazy dog.'
Tokens: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
POS Tags: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]
Word: The, POS Tag: DT
Word: quick, POS Tag: JJ
Word: brown, POS Tag: NN
Word: fox, POS Tag: NN
Word: jumps, POS Tag: VBZ
Word: over, POS Tag: IN
Word: the, POS Tag: DT
Word: lazy, POS Tag: JJ
Word: dog, POS Tag: NN
Word: ., POS Tag: .


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\olive\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


The abbreviations like DT, JJ, NN, etc., are Part-of-Speech (POS) tags used by NLTK. They represent the grammatical category of each word. Here's a breakdown of the ones you've seen and some common ones:

- DT: Determiner (e.g., "The", "a", "an")
- JJ: Adjective (e.g., "quick", "brown", "lazy", "new", "brilliant", "few", "slow")
- NN: Noun, singular or mass (e.g., "fox", "dog", "movie", "acting", "storyline", "edge", "seat", "bit")
- VBZ: Verb, 3rd person singular present (e.g., "jumps", "is")
- IN: Preposition or subordinating conjunction (e.g., "over", "on", "of", "though")
- .: Punctuation mark, sentence closer (e.g., ".", "!")
- RB: Adverb (e.g., "absolutely", "highly")
- VBD: Verb, past tense (e.g., "was", "kept", "were")
- VBN: Verb, past participle (e.g., "superb" - often tagged as VBN when acting as an adjective derived from a verb)
- PRP: Personal pronoun (e.g., "me", "I", "it")
- PRP$: Possessive pronoun (e.g., "my")
- VBP: Verb, non-3rd person singular present (e.g., "recommend")
- CC: Coordinating conjunction (e.g., "and")
- NNS: Noun, plural (e.g., "scenes")

These tags help in understanding the grammatical structure of sentences, which is foundational for many advanced NLP tasks!

## Exercise 2: Dependency Parsing with SpaCy

**Task**: Analyze the grammatical structure of a sentence by identifying the head word for each word and the type of dependency relation between them.
**Library**: SpaCy provides efficient and accurate dependency parsing capabilities.

In [3]:
import spacy

print("\n--- Exercise 2: Dependency Parsing with SpaCy ---")
# Load SpaCy model - ensure 'en_core_web_sm' is downloaded
try:
    nlp = spacy.load("en_core_web_sm")
except OSError:
    print("Downloading en_core_web_sm model for SpaCy...")
    from spacy.cli import download
    download("en_core_web_sm")
    nlp = spacy.load("en_core_web_sm")

text7 = "Apple is looking at buying U.K. startup for $1 billion."
doc = nlp(text7) # Process the text with the loaded SpaCy model

print(f"Original text: '{text7}'")
print("Dependency Parse:")
for token in doc:
    # Print the token, its part-of-speech, its head word, and the dependency relation
    print(f"  {token.text:<10} {token.pos_:<10} {token.dep_:<15} {token.head.text:<10}")

# You can also visualize the dependency tree
# For visualization, you might need to install 'displacy'
# import displacy
# displacy.render(doc, style="dep", jupyter=True, options={'compact': True, 'distance': 90})



--- Exercise 2: Dependency Parsing with SpaCy ---
Original text: 'Apple is looking at buying U.K. startup for $1 billion.'
Dependency Parse:
  Apple      PROPN      nsubj           looking   
  is         AUX        aux             looking   
  looking    VERB       ROOT            looking   
  at         ADP        prep            looking   
  buying     VERB       pcomp           at        
  U.K.       PROPN      nsubj           startup   
  startup    VERB       ccomp           buying    
  for        ADP        prep            startup   
  $          SYM        quantmod        billion   
  1          NUM        compound        billion   
  billion    NUM        pobj            for       
  .          PUNCT      punct           looking   


## Exercise 3: Text Classification with scikit-learn

**Task**: Categorize text into predefined classes (e.g., positive/negative sentiment).
**Library**: `scikit-learn` is a powerful and widely used machine learning library in Python.

In [4]:
import nltk
nltk.download('stopwords') # Often useful for text preprocessing in classification

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

print("\n--- Exercise 3: Text Classification with scikit-learn ---")

# Sample Dataset (simple movie review sentiments)
texts = [
    "This movie is fantastic and I love it!",
    "What a terrible film, absolutely dreadful.",
    "The acting was good, but the plot was boring.",
    "A truly amazing experience, highly recommended.",
    "I hated every minute of this movie, so bad.",
    "It was an okay film, nothing special."
]
labels = ['positive', 'negative', 'neutral', 'positive', 'negative', 'neutral'] # Corresponding labels

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42)

# Create a pipeline: Vectorizer -> Classifier
# CountVectorizer converts text into a matrix of token counts
# MultinomialNB (Naive Bayes) is a simple, yet effective classifier for text
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', MultinomialNB()),
])

# Train the classifier
text_clf.fit(X_train, y_train)

# Make predictions on the test set
predicted = text_clf.predict(X_test)

print(f"Original text for prediction: '{X_test}'")
print(f"Actual labels: {y_test}")
print(f"Predicted labels: {predicted}")
print("\nClassification Report:")
print(classification_report(y_test, predicted, zero_division=0)) # Print classification metrics

# Predict on a new unseen sentence
new_sentence = "I really enjoyed this production, very entertaining!"
predicted_sentiment = text_clf.predict([new_sentence])
print(f"\nNew sentence: '{new_sentence}'")
print(f"Predicted sentiment: {predicted_sentiment[0]}")


--- Exercise 3: Text Classification with scikit-learn ---
Original text for prediction: '['This movie is fantastic and I love it!', 'What a terrible film, absolutely dreadful.']'
Actual labels: ['positive', 'negative']
Predicted labels: ['negative' 'neutral']

Classification Report:
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00       1.0
     neutral       0.00      0.00      0.00       0.0
    positive       0.00      0.00      0.00       1.0

    accuracy                           0.00       2.0
   macro avg       0.00      0.00      0.00       2.0
weighted avg       0.00      0.00      0.00       2.0


New sentence: 'I really enjoyed this production, very entertaining!'
Predicted sentiment: negative


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\olive\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Challenge: Integrated NLP Pipeline

**Task**: Take a new paragraph of text, perform several NLP steps on it, and then classify its sentiment. This exercise integrates concepts from Part-of-Speech Tagging, Dependency Parsing, and Text Classification.

**Steps:**
1.  **Choose a new paragraph of text.**
2.  **Perform Tokenization and Part-of-Speech Tagging** using NLTK (similar to Exercise 1).
3.  **Perform Dependency Parsing** using SpaCy to analyze the grammatical structure (similar to Exercise 2).
4.  **Classify the sentiment** of the paragraph using the pre-trained `text_clf` model from Exercise 3. You might need to consider how to apply a sentence-level classifier to a paragraph.

In [7]:
import nltk
import spacy
nltk.download('punkt') # Required for tokenization
nlp = spacy.load("en_core_web_sm") # Load SpaCy model
from textblob import TextBlob


paragraph = "Films are produced by recording actual people and objects with cameras or by creating them using animation techniques and special effects. They comprise a series of individual frames, but when these images are shown rapidly in succession, the illusion of motion is given to the viewer. Flickering between frames is not seen due to an effect known as persistence of vision, whereby the eye retains a visual image for a fraction of a second after the source has been removed. Also of relevance is what causes the perception of motion; a psychological effect identified as beta movement. "

# tokenisation and part-of-speech tagging
tokens = nltk.word_tokenize(paragraph)
pos_tags = nltk.pos_tag(tokens)
print(f"Original text: '{text5}'")
print(f"Tokens: {tokens5}")
print(f"POS Tags: {pos_tags}")

# dependency parsing
doc = nlp(paragraph)
print("Dependency Parse:")
for token in doc:
    print(f"  {token.text:<10} {token.pos_:<10} {token.dep_:<15} {token.head.text:<10}")
    


# sentiment of the paragraph
blob = TextBlob(paragraph)
sentiment = blob.sentiment
print("\nSentiment Analysis:")
print(f"Polarity: {sentiment.polarity}, Subjectivity: {sentiment.subjectivity}")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\olive\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Original text: 'The quick brown fox jumps over the lazy dog.'
Tokens: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
POS Tags: [('Films', 'NNS'), ('are', 'VBP'), ('produced', 'VBN'), ('by', 'IN'), ('recording', 'VBG'), ('actual', 'JJ'), ('people', 'NNS'), ('and', 'CC'), ('objects', 'NNS'), ('with', 'IN'), ('cameras', 'NNS'), ('or', 'CC'), ('by', 'IN'), ('creating', 'VBG'), ('them', 'PRP'), ('using', 'VBG'), ('animation', 'NN'), ('techniques', 'NNS'), ('and', 'CC'), ('special', 'JJ'), ('effects', 'NNS'), ('.', '.'), ('They', 'PRP'), ('comprise', 'VBP'), ('a', 'DT'), ('series', 'NN'), ('of', 'IN'), ('individual', 'JJ'), ('frames', 'NNS'), (',', ','), ('but', 'CC'), ('when', 'WRB'), ('these', 'DT'), ('images', 'NNS'), ('are', 'VBP'), ('shown', 'VBN'), ('rapidly', 'RB'), ('in', 'IN'), ('succession', 'NN'), (',', ','), ('the', 'DT'), ('illusion', 'NN'), ('of', 'IN'), ('motion', 'NN'), ('is', 'VBZ'), ('given', 'VBN'), ('to', 'TO'), ('the', 'DT'), ('viewer', 'NN'