<h3>I]Describe Named Entity Recognition.

Named Entity Recognition (NER) is a fundamental task in natural language processing that involves identifying and categorizing named entities within a text. These entities can include names of persons, organizations, locations, dates, quantities, and more. NER typically involves tokenizing the input text, assigning part-of-speech tags to each token, and then classifying them into predefined categories. This classification is often performed using machine learning techniques such as statistical models or deep learning architectures trained on labeled datasets. NER plays a crucial role in various NLP applications, including information retrieval, question answering, and sentiment analysis, by enabling the extraction of structured information from unstructured text data.

<h3>II] Write a python code to recognize named entities in a document. The input should be a
text file containing 200 - 300 words in it. The output should be written to another file.

In [1]:
import spacy
from spacy import displacy
import re

In [2]:
NER = spacy.load("en_core_web_sm")

In [3]:
#Getting the input text
file = open("input.txt", "r")
raw_text = file.read()
file.close()

In [4]:
count = 0
for word in raw_text:
    count+=1
print(f"Number of words {count}")

Number of words 1531


In [5]:
print(raw_text)

The solar eclipse of April 8, 2024, also known as the Great North American eclipse, was a total solar eclipse visible across a band covering parts of North America, from Mexico to Canada and crossing the contiguous United States. A solar eclipse occurs when the Moon passes between Earth and the Sun, thereby obscuring the Sun. A total solar eclipse occurs when the Moon's apparent diameter is larger than the Sun's, blocking all direct sunlight. Totality occurs only in a limited path across Earth's surface, with the partial solar eclipse visible over a larger surrounding region.
The Moon's apparent diameter was 5.5 percent larger than average. With a magnitude of 1.0566, the eclipse's longest duration of totality was 4 minutes and 28.13 seconds just 4 mi (6 km) north of the Mexican town of Nazas, Durango.
This eclipse was the first total solar eclipse visible from Canada since February 26, 1979;[1][2] the first over Mexico since July 11, 1991;[3] and the first over the United States since

In [6]:
#Cleaning
cleaned_text = re.sub(r'\[\d+\]', '', raw_text)

print(cleaned_text)

The solar eclipse of April 8, 2024, also known as the Great North American eclipse, was a total solar eclipse visible across a band covering parts of North America, from Mexico to Canada and crossing the contiguous United States. A solar eclipse occurs when the Moon passes between Earth and the Sun, thereby obscuring the Sun. A total solar eclipse occurs when the Moon's apparent diameter is larger than the Sun's, blocking all direct sunlight. Totality occurs only in a limited path across Earth's surface, with the partial solar eclipse visible over a larger surrounding region.
The Moon's apparent diameter was 5.5 percent larger than average. With a magnitude of 1.0566, the eclipse's longest duration of totality was 4 minutes and 28.13 seconds just 4 mi (6 km) north of the Mexican town of Nazas, Durango.
This eclipse was the first total solar eclipse visible from Canada since February 26, 1979; the first over Mexico since July 11, 1991; and the first over the United States since August 2

In [7]:
text = NER(cleaned_text)

In [8]:
#Getting all the Categories in in the text
all_labels = set()
for word in text.ents:
    all_labels.add(word.label_)

In [9]:
with open("output.txt","w") as f:
    f.write("Label explanations: \n")
for label in all_labels:
    with open("output.txt","a") as f:
        f.write(f"{label} : {spacy.explain(label)}\n")
        # f.write()
        f.write("\n")

In [10]:
with open("output.txt","a") as f:
    f.write("\nThe words and labels are \n")
for word in text.ents:
    with open("output.txt","a") as f:
        f.write(f"{word.text} {word.label_}\n")

In [11]:
displacy.render(text,style="ent",jupyter=True)