# Lesson 5: Understanding Named Entity Recognition in NLP

## Introduction
Welcome to our lesson on Named Entity Recognition! Today, we'll be diving deep into the world of NLP and discovering how we can identify informative chunks of text, namely "Named Entities". The goal of this lesson is to learn about Part of Speech (POS) tagging and Named Entity Recognition (NER). By the end, you'll be able to gather specific types of data from text and get a few steps closer to mastering text classification.

## What is Named Entity Recognition?
Imagine we have a piece of text and we want to get some quick insights. What are the main subjects? Are there any specific locations or organizations being talked about? This is where Named Entity Recognition (NER) comes in handy.

In natural language processing (NLP), NER is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as names of persons, organizations, locations, expressions of times, quantities, monetary values, and percentages.

For instance, consider the sentence: "Apple Inc. is planning to open a new store in San Francisco." Using NER, we could identify that "Apple Inc." is an organization and "San Francisco" is a location. Such information can be incredibly valuable for numerous NLP tasks.

## Part of Speech (POS) Tagging
Every word in a sentence has a particular role. Some words are objects, some are verbs, some are adjectives, and so on. Tagging these parts of speech, or POS tagging, can be a critical component to many NLP tasks. It can help answer many questions, like what are the main objects in a sentence, what actions are being taken, and what's the context of these actions?

Let's start with a sentence example: "Apple Inc. is planning to open a new store in San Francisco."

```python
from nltk import pos_tag, word_tokenize

example_sentence = "Apple Inc. is planning to open a new store in San Francisco."
tokens = word_tokenize(example_sentence)
pos_tags = pos_tag(tokens)
print(f'The first 5 POS tags are: {pos_tags[:5]}')
```

Output:
```
The first 5 POS tags are: [('Apple', 'NNP'), ('Inc.', 'NNP'), ('is', 'VBZ'), ('planning', 'VBG'), ('to', 'TO')]
```

## Named Entity Recognition with NLTK
Named Entity Recognition (NER) can be considered a step beyond regular POS tagging. It groups together one or more words that signify a named entity such as "San Francisco" or "Apple Inc." into a single category, i.e., location or organization in this case.

```python
from nltk import ne_chunk

named_entities = ne_chunk(pos_tags)
print(f'The named entities in our example sentences are:\n{named_entities}')
```

Output:
```
The named entities in our example sentences are:
(S
  (PERSON Apple/NNP)
  (ORGANIZATION Inc./NNP)
  is/VBZ
  planning/VBG
  to/TO
  open/VB
  a/DT
  new/JJ
  store/NN
  in/IN
  (GPE San/NNP Francisco/NNP)
  ./.)
```

### Understanding the Output
* The 'S' at the beginning signifies the start of a sentence
* Words inside parentheses, prefixed with labels such as PERSON, ORGANIZATION, or GPE are recognized named entities
* Words outside parentheses are not recognized as part of a named entity but are part of the sentence
* '(GPE San/NNP Francisco/NNP)' indicates that 'San Francisco' is recognized as a geopolitical entity

## Applying PoS Tagging and NER to a Real Dataset
Let's use these techniques on the 20 Newsgroups dataset:

```python
from sklearn.datasets import fetch_20newsgroups
from nltk import pos_tag, ne_chunk, word_tokenize

# Loading the data with metadata removed
newsgroups_data = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

# Selecting the first document 
first_doc = newsgroups_data.data[0]

# Trimming the document's text down to the first 67 characters
first_doc = first_doc[:67]

# Tokenizing the text
tokens_first_doc = word_tokenize(first_doc)

# Applying POS tagging
pos_tags_first_doc = pos_tag(tokens_first_doc)

# Applying Named Entity Recognition
named_entities = ne_chunk(pos_tags_first_doc)

print(f'The first chunk of named entities in the first document are:\n{named_entities}')
```

## Lesson Summary
In this lesson, we have covered:
* Part of Speech (POS) tagging
* Named Entity Recognition (NER)
* Application of these techniques to real-world data
* Understanding and interpreting NER output

These concepts are fundamental to text preprocessing in Natural Language Processing. Having a grasp over these will allow you to approach more advanced topics in NLP with ease.

Keep practicing these techniques to reinforce your understanding and improve your Natural Language Processing skills. Enjoy the hands-on session and keep learning!

## Changing the Sentence for Named Entity Recognition

Great work, Space Voyager!

Now, change the sentence in the provided code to "Microsoft Corporation aims to reach 95% global coverage by 2025." and observe how the named entities identified by the Named Entity Recognition (NER) change.

Keep advancing!

```python
from nltk import pos_tag, word_tokenize, ne_chunk

# Given sentence
sample_sentence = "Apple Inc. planned to open a new store in San Francisco in 82% of time by 2020."

# Get tokens
tokens = word_tokenize(sample_sentence)

# Get POS tags
pos_tags = pos_tag(tokens)

# Perform NER
named_entities = ne_chunk(pos_tags)

# Print the identified named entities
print(named_entities)
```

Here's how the code would look after modifying the sentence, along with an explanation of the expected changes in Named Entity Recognition (NER) output:

---

### Updated Code:
```python
from nltk import pos_tag, word_tokenize, ne_chunk

# Updated sentence
sample_sentence = "Microsoft Corporation aims to reach 95% global coverage by 2025."

# Get tokens
tokens = word_tokenize(sample_sentence)

# Get POS tags
pos_tags = pos_tag(tokens)

# Perform NER
named_entities = ne_chunk(pos_tags)

# Print the identified named entities
print(named_entities)
```

---

### Expected Changes in NER Output:  
Given the new sentence, the Named Entity Recognition (NER) should yield the following:

1. **"Microsoft Corporation"**:  
   - Will likely be classified as an **ORGANIZATION** because it is a known corporate entity.  

2. **"2025"**:  
   - Should be categorized as a **DATE**, as it refers to a specific year.  

3. **"95%"**:  
   - May not be classified as a named entity but should retain its POS tag (e.g., as a **CD** for cardinal number).  

---

### Expected Output:
```plaintext
(S
  (ORGANIZATION Microsoft/NNP Corporation/NNP)
  aims/VBZ
  to/TO
  reach/VB
  95/CD
  %/NN
  global/JJ
  coverage/NN
  by/IN
  (DATE 2025/CD)
  ./.)
```

- **(ORGANIZATION Microsoft/NNP Corporation/NNP)**: Recognized as an organization.
- **(DATE 2025/CD)**: Recognized as a date.  
- Other words (e.g., "aims", "global", "coverage") retain their respective POS tags.

---

### Observations:
- Replacing the sentence changes the entities identified (e.g., *Microsoft Corporation* vs. *Apple Inc.*).  
- Dates like "2025" are tagged explicitly as **DATE**, showcasing NER's ability to classify time-related expressions.  

Let me know if you’d like to analyze further! 🚀

## Implementing Tokenization and POS Tagging

Excellent work, Space Explorer!

Let's reinforce your learning by inserting the missing code. Fill in the appropriate functions to perform tokenization and POS tagging. Use nltk's powerful tools to achieve this. Happy learning!

```python
from nltk import ____, word_tokenize

# Given sentence
sample_sentence = "The quick brown fox jumps over the lazy dog."

# Get tokens
tokens = word_tokenize(sample_sentence)

# Get POS tags
pos_tags = ____(tokens)

# Print the POS tags
print(pos_tags)

```

Here's the completed code snippet with the appropriate functions filled in: 

```python
from nltk import pos_tag, word_tokenize

# Given sentence
sample_sentence = "The quick brown fox jumps over the lazy dog."

# Get tokens
tokens = word_tokenize(sample_sentence)

# Get POS tags
pos_tags = pos_tag(tokens)

# Print the POS tags
print(pos_tags)
```

### Explanation:
1. **`word_tokenize`**: This function splits the sentence into individual words (tokens).
2. **`pos_tag`**: This function assigns Part-of-Speech (POS) tags to the tokens, providing their grammatical roles.

### Output:
When you run this code, the output will look like this:
```
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
```

This output represents the tokenized words along with their respective POS tags. For example:
- `'DT'` = Determiner
- `'JJ'` = Adjective
- `'NN'` = Noun
- `'VBZ'` = Verb, 3rd person singular present
- `'IN'` = Preposition    

## Applying Named Entity Recognition to a Sentence

Great job, Space Voyager!

Now that you've created a function to perform Named Entity Recognition (NER), it's time to use it on some text. Fill in the blanks to apply your apply_ner function to the sentence: "CodeSignal is headquartered in San Francisco."

Print the recognized named entities from the sentence to see how it works.

Keep advancing!

```python
from nltk import ____, pos_tag, word_tokenize

def apply_ner(sentence):
    # Tokenize the sentence
    tokens = word_tokenize(sentence)

    # Apply POS tagging
    pos_tags = pos_tag(tokens)

    # Apply NER
    named_entities = ____(____)

    # Return the identified named entities
    return named_entities

# Apply the function to a specific sentence
named_entities_in_sentence = apply_ner("CodeSignal is headquartered in San Francisco.")
print(named_entities_in_sentence)

```

I'll help you complete this code by filling in the blanks. Here's how to perform Named Entity Recognition on that sentence.

```python
from nltk import ne_chunk, pos_tag, word_tokenize

def apply_ner(sentence):
    # Tokenize the sentence
    tokens = word_tokenize(sentence)

    # Apply POS tagging
    pos_tags = pos_tag(tokens)

    # Apply NER
    named_entities = ne_chunk(pos_tags)

    # Return the identified named entities
    return named_entities

# Apply the function to a specific sentence
named_entities_in_sentence = apply_ner("CodeSignal is headquartered in San Francisco.")
print(named_entities_in_sentence)
```

I filled in:
1. `ne_chunk` in the import statement - this is NLTK's function for Named Entity Recognition
2. `ne_chunk(pos_tags)` in the function - this applies NER to our POS-tagged tokens

When you run this code, it will output something like:
```
(S
  (ORGANIZATION CodeSignal/NNP)
  is/VBZ
  headquartered/VBN
  in/IN
  (GPE San/NNP Francisco/NNP)
  ./.)
```

The output shows that:
- "CodeSignal" is recognized as an ORGANIZATION
- "San Francisco" is recognized as a GPE (Geo-Political Entity)
- Other words are tagged with their parts of speech but aren't identified as named entities

## Implementing a Named Entity Extraction Function

Galactic Pioneer, you're doing great!

Your task is to complete the function extract_named_entities. This function should tokenize the related text, perform POS tagging, and carry out Named Entity Recognition (NER). The function should then return the named entities. Forge ahead!

```python
from nltk import pos_tag, word_tokenize, ne_chunk

def extract_named_entities(text):
    # TODO: Tokenize the text
    # TODO: Apply POS tagging
    # TODO: Perform NER and return the result
    pass

# Provided sentence
sentence = "Microsoft Corporation aims to reach 95% global coverage by 2025."

# Print the extracted named entities
print(extract_named_entities(sentence))
```


Here’s the completed function that extracts named entities using nltk:

Extract Named Entities
```python
from nltk import pos_tag, word_tokenize, ne_chunk
from nltk.tree import Tree

def extract_named_entities(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Apply POS tagging
    pos_tags = pos_tag(tokens)
    
    # Perform Named Entity Recognition (NER)
    named_entities_tree = ne_chunk(pos_tags)
    
    # Extract named entities from the tree
    named_entities = []
    for subtree in named_entities_tree:
        if isinstance(subtree, Tree):  # Named entity detected
            entity_name = " ".join([token for token, pos in subtree.leaves()])
            entity_type = subtree.label()  # Get entity type (e.g., PERSON, ORGANIZATION)
            named_entities.append((entity_name, entity_type))
    
    return named_entities

# Provided sentence
sentence = "Microsoft Corporation aims to reach 95% global coverage by 2025."

# Print the extracted named entities
print(extract_named_entities(sentence))
```

This function:

Tokenizes the input text.
1. Applies POS (Part-of-Speech) tagging.
2. Uses ne_chunk to perform Named Entity Recognition (NER).
3. Extracts named entities along with their labels (e.g., ORGANIZATION, PERSON, etc.).
4. Returns a list of named entities and their types.
Try running it with different sentences to see how it detects named entities! 🚀

Brilliant trajectory, Space Voyager!

You have been assigned a challenging task. Add the missing lines to tokenize the first document in newsgroups_data, and then perform POS tagging and Named Entity Recognition. Afterwards, print the named entities. Keep pushing boundaries!

```python
from sklearn.datasets import fetch_20newsgroups
# TODO: Import the necessary NLTK modules for tokenization, POS tagging, and NER


# Loading the data with metadata removed
newsgroups_data = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

# TODO: Extract the first document from the dataset

# TODO: Tokenize the text

# TODO: Apply POS tagging

# TODO: Apply Named Entity Recognition

# TODO: Print the identified named entities

```

To complete the task of extracting named entities from the first document in the `newsgroups_data` dataset, we'll need to import the necessary modules from NLTK for tokenization, POS tagging, and Named Entity Recognition (NER). We'll then apply these tools to the text data retrieved from `fetch_20newsgroups`. Here's how you can do it:

```python
from sklearn.datasets import fetch_20newsgroups
from nltk import pos_tag, word_tokenize, ne_chunk, tree

# Loading the data with metadata removed
newsgroups_data = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))

# Extract the first document from the dataset
first_document = newsgroups_data.data[0]

# Tokenize the text
tokens = word_tokenize(first_document)

# Apply POS tagging
tagged = pos_tag(tokens)

# Apply Named Entity Recognition
ner_tree = ne_chunk(tagged)

# Extract and print the identified named entities
named_entities = []
for subtree in ner_tree:
    if type(subtree) == tree.Tree:  # Check if it is a NER subtree
        entity_label = subtree.label()  # Get the entity type (e.g., PERSON, ORGANIZATION)
        entity_name = " ".join(word for word, tag in subtree.leaves())  # Join the entity words
        named_entities.append((entity_name, entity_label))

# Print the named entities
for entity in named_entities:
    print(entity)
```

### Explanation:
- **Loading Data**: The `fetch_20newsgroups` function is used to load the newsgroups data. We specify `subset='train'` to get the training subset and `remove=('headers', 'footers', 'quotes')` to strip off additional metadata and text that could skew the analysis.
- **Extract First Document**: We take the first document from the dataset by accessing the zeroth index of the `data` attribute.
- **Tokenization**: We use `word_tokenize` to split the document into individual tokens (words and punctuation).
- **POS Tagging**: The `pos_tag` function assigns a part-of-speech tag to each token.
- **Named Entity Recognition**: The `ne_chunk` function processes the tagged tokens to identify named entities. It returns a tree structure where named entities are grouped into subtrees with labels indicating their types.
- **Extract Named Entities**: We iterate through the elements of the tree. If an element is a subtree, we identify it as a named entity, extract its label and the words forming the entity, and store these as tuples in the `named_entities` list.
- **Print Named Entities**: Finally, we print each named entity found in the document.

This script will output all named entities recognized in the first document of the `newsgroups_data` dataset, providing insights into the types of entities (like organizations, persons, locations) mentioned in the text.