<h2>Funnel for NLP</h2>

  1. Tokenization
  2. Stemming
  3. POS(Tagger)

<br>

<h1> Named Entity Recognition(NER)</h1>

Sub domain under NLP.
A part of IE(Information extraction)
Asssigning a tag to an entity.

For example, "**Toshith** will go to Gangtok and **he** wil do something."<br>
Toshith and he are the same entity.

<br>


Features of NER:

1. Word level features
2. List look up features
3. Document and Corpus features.


Techniques of REF:

1. Rule based
2. Supervised
3. Semi supervised
4. Unsupervised

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm
import nltk
nltk.download("punkt")
nltk.download("stopwords")
nltk.download('averaged_perceptron_tagger')
nltk.download("treebank")
nltk.download("words")
nltk.download("maxent_ne_chunker")
nltk.download("conll2002")

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tag import pos_tag
from nltk.chunk import conlltags2tree, tree2conlltags


def preprocess(txt):
  tokens = word_tokenize(txt)
  stop_words = set(stopwords.words("english"))
  filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
  return filtered_tokens

def extract_entities(txt):
  tokens = preprocess(txt)
  tagged_tokens = pos_tag(tokens)

  ne_tree = nltk.ne_chunk(tagged_tokens)
  iob_tags = tree2conlltags(ne_tree)   #Inside out beginnning
  return iob_tags


text = "Narendra Modi was born in India. He is the Prime Minister of India"
entities = extract_entities(text)
print("Named Entities")
for word, pos_tag, entity_tag in entities:
  if entity_tag != 0:
    print(f"Word: {word}, POS_Tag: {pos_tag}, Entity_Tag: {entity_tag}")

Named Entities
Word: NarendraModi, POS_Tag: NNP, Entity_Tag: B-GPE
Word: bor, POS_Tag: NN, Entity_Tag: O
Word: India, POS_Tag: NNP, Entity_Tag: B-GPE
Word: ., POS_Tag: ., Entity_Tag: O
Word: Prime, POS_Tag: NNP, Entity_Tag: O
Word: Minister, POS_Tag: NNP, Entity_Tag: O
Word: India, POS_Tag: NNP, Entity_Tag: B-GPE


In [None]:
import pandas as pd
import spacy
import requests
from bs4 import BeautifulSoup

nlp = spacy.load("en_core_web_sm")
pd.set_option("display.max_rows", 200)

In [None]:
content = "CM Arvind Kejriwal slapped with a slipper amidst election rally. Judges in the Supreme Court died laughing."
doc = nlp(content)
#spacy.displacy.render(doc, style="ent")
for ent in doc.ents:
  print(ent.text, ent.start_char, ent.end_char, ent.label_)

CM Arvind Kejriwal 0 18 PERSON
the Supreme Court 75 92 ORG


In [None]:
import nltk
from nltk.chunk import ne_chunk
from nltk.chunk.util import tree2conlltags
from nltk.corpus import conll2002
from sklearn.metrics import accuracy_score

def load_conll2002_data():
  train_sents = list(conll2002.iob_sents('esp.train'))
  test_sents = list(conll2002.iob_sents('esp.testb'))
  return train_sents, test_sents

def evaluate_ner(train_sents, test_sents):
  chunking_rule = r'''
  NP: {<DT|JJ|NN.*>+}
  PP: {<IN><NP>}
  VP: {<VB.*><NP|PP|CLAUSE>+$}
  CLAUSE: {<NP><VP>}
  '''

  chunker = nltk.RegexpParser(chunking_rule)
  parsed_test_sents = [chunker.parse(sent) for sent in test_sents]

  predicted_labels = []
  true_labels = []
  for parsed_sent, test_sent in zip(parsed_test_sents, test_sents):
    iob_tags = tree2conlltags(parsed_sent)
    predicted_labels.extend([tag for word, pos, tag, in iob_tags])
    true_labels.extend([tag for word, pos, tag, in test_sent])

  accuracy = accuracy_score(true_labels, predicted_labels)
  return accuracy

train, test = load_conll2002_data()
accuracy = evaluate_ner(train, test)
print(accuracy)

0.8800186288397726
