In [None]:
# This notebook demonstrates fundamental Natural Language Processing (NLP) tasks using popular Python libraries.
# Each exercise focuses on a specific NLP technique:
# 1.  **Tokenization** with NLTK
# 2.  **Named Entity Recognition (NER)** with SpaCy
# 3.  **Sentiment Analysis** with TextBlob
# 4.  **Text Summarization** with Sumy


### Explanation of NLP Concepts:

1.  **Tokenization** with NLTK: The process of breaking down a text into smaller units called tokens, which can be words, subwords, or characters. NLTK provides tools like `word_tokenize` for this task.

2.  **Named Entity Recognition (NER)** with SpaCy: A technique to identify and classify named entities in text into predefined categories such as person names, organizations, locations, expressions of times, quantities, monetary values, etc. SpaCy is a library widely used for this purpose.

3.  **Sentiment Analysis** with TextBlob: The computational study of opinions, sentiments, and emotions expressed in text. It determines the emotional tone behind a piece of text, often classifying it as positive, negative, or neutral. TextBlob offers a simple API for sentiment analysis.

4.  **Text Summarization** with Sumy: The process of condensing a longer text into a shorter, coherent, and fluent version while retaining the most important information and overall meaning of the original text. Sumy is a Python library that provides various summarization algorithms, like LSA.

## Import Libraries and Download Resources

In [None]:
import nltk
nltk.download('punkt') # Required for NLTK's word tokenizer
nltk.download('punkt_tab') # Required for some NLTK tokenizer functionalities (e.g., used by Sumy)
from nltk.tokenize import word_tokenize

import spacy

from textblob import TextBlob

## Install `sumy` library

In [None]:
# Install sumy if not already installed
!pip install sumy

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
nltk.download('stopwords') # Sumy often uses NLTK stopwords for summarization

## Exercise 1: Tokenization with NLTK

**Task**: Break down a given text into individual words or tokens.
**Library**: NLTK (Natural Language Toolkit) is a powerful library for working with human language data.

In [None]:
print("\n--- Exercise 1: Tokenization with NLTK ---")
text1 = "Natural Language Processing enables computers to understand human language."
tokens = word_tokenize(text1) # Uses NLTK's word_tokenize function to split the text
print(f"Original text: '{text1}'")
print(f"Tokens: {tokens}")

### Challenge 1: Tokenization

**Task**: Experiment with different NLTK tokenizers. For example, try `wordpunct_tokenize` or `TreebankWordTokenizer` (after importing if necessary) and observe the differences in tokenization for the given text or a new sentence of your choice. Pay attention to how punctuation is handled.

**Hint**: You might need to import `nltk.tokenize.wordpunct_tokenize` or `nltk.tokenize.TreebankWordTokenizer`.

## Exercise 2: Named Entity Recognition with SpaCy

**Task**: Identify and classify named entities (like persons, organizations, locations) in text.
**Library**: SpaCy is an industrial-strength natural language processing library in Python.

In [None]:
print("\n--- Exercise 2: Named Entity Recognition with SpaCy ---")
# Load SpaCy model - ensure 'en_core_web_sm' is downloaded (you might need !python -m spacy download en_core_web_sm)
try:
    nlp = spacy.load("en_core_web_sm") # Loads a small English model for processing
except OSError:
    print("Downloading en_core_web_sm model for SpaCy...")
    from spacy.cli import download
    download("en_core_web_sm") # If model is not found, download it automatically
    nlp = spacy.load("en_core_web_sm")

text2 = "Google was founded by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University."
doc = nlp(text2) # Process the text with the loaded SpaCy model
print(f"Original text: '{text2}'")
print("Named Entities:")
for ent in doc.ents: # Iterate through the detected entities
    print(f"  {ent.text:<20} {ent.label_}") # Print the entity text and its label (e.g., PERSON, ORG)

### Challenge 2: Named Entity Recognition

**Task**: Apply NER to a new sentence that contains different types of entities (e.g., dates, monetary values, products). Analyze the output and see if SpaCy correctly identifies and labels them. What happens if you use a sentence with less common entities?

**Example Sentence**: "Apple released the iPhone 15 in September 2023 for $799. The event took place in Cupertino, California."

## Exercise 3: Sentiment Analysis with TextBlob

**Task**: Determine the emotional tone behind a piece of text, usually categorizing it as positive, negative, or neutral.
**Library**: TextBlob is a simple Python library for processing textual data. It provides a simple API for common NLP tasks.

In [None]:
print("\n--- Exercise 3: Sentiment Analysis with TextBlob ---")
text3 = "I am extremely happy with the service provided."
blob = TextBlob(text3) # Create a TextBlob object from the text
sentiment = blob.sentiment # Access the sentiment property, which returns polarity and subjectivity
print(f"Original text: '{text3}'")
print(f"Sentiment: {sentiment}") # Polarity ranges from -1 (negative) to 1 (positive), Subjectivity from 0 (objective) to 1 (subjective)

### Challenge 3: Sentiment Analysis

**Task**: Choose a short paragraph from a product review or a news article. Perform sentiment analysis using TextBlob. Analyze the `polarity` and `subjectivity` scores. How well does it align with your own understanding of the text's sentiment? Try sentences with sarcasm or nuanced language and observe the results.

## Exercise 4: Text Summarization with Sumy

**Task**: Condense a longer text into a shorter version while retaining the most important information.
**Library**: Sumy is a Python library for automatic text summarization of text documents.

In [None]:
print("\n--- Exercise 4: Text Summarization with Sumy ---")
text4 = "Natural Language Processing (NLP) is a fascinating field at the intersection of computer science, artificial intelligence, and linguistics. It enables machines to understand, interpret, and generate human language, opening up a world of possibilities for applications ranging from chatbots and translation services to sentiment analysis and beyond. This field involves various techniques, including machine learning, deep learning, and rule-based methods, to process and analyze large amounts of text data. The goal of NLP is to bridge the communication gap between humans and computers, allowing for more natural and intuitive interactions. Its applications are constantly expanding, making it a critical area of research and development in today's technologically driven world."
parser = PlaintextParser.from_string(text4, Tokenizer("english")) # Parse the text using Sumy's PlaintextParser and English tokenizer
summarizer = LsaSummarizer() # Initialize an LSA (Latent Semantic Analysis) summarizer
summary = summarizer(parser.document, 2)  # Summarize the document into 2 sentences
print(f"Original text (first 100 chars): '{text4[:100]}...' ")
print("Summary (2 sentences):")
for sentence in summary:
    print(f"  - {sentence}") # Print each sentence of the generated summary

### Challenge 4: Text Summarization

**Task**: Take a longer article or a portion of text (e.g., from Wikipedia) and apply Sumy's LSA summarizer. Experiment with different numbers of sentences for the summary (e.g., 3, 5). Compare the generated summaries to see which one best captures the essence of the original text while remaining concise. You can also explore other summarization algorithms available in Sumy if you're feeling adventurous (e.g., `LexRankSummarizer`).