<a href="https://colab.research.google.com/github/tanaymukherjee/Natural-Language-Processing/blob/master/03_Named_entity_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition

## NER with NLTK

In [100]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [101]:
# Import necessary modules
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

In [102]:
article = '\ufeffThe taxi-hailing company Uber brings into very sharp focus the question of whether corporations can be said to have a moral character. If any human being were to behave with the single-minded and ruthless greed of the company, we would consider them sociopathic. Uber wanted to know as much as possible about the people who use its service, and those who don’t. It has an arrangement with unroll.me, a company which offered a free service for unsubscribing from junk mail, to buy the contacts unroll.me customers had had with rival taxi companies. Even if their email was notionally anonymised, this use of it was not something the users had bargained for. Beyond that, it keeps track of the phones that have been used to summon its services even after the original owner has sold them, attempting this with Apple’s phones even thought it is forbidden by the company.\r\n\r\n\r\nUber has also tweaked its software so that regulatory agencies that the company regarded as hostile would, when they tried to hire a driver, be given false reports about the location of its cars. Uber management booked and then cancelled rides with a rival taxi-hailing company which took their vehicles out of circulation. Uber deny this was the intention. The punishment for this behaviour was negligible. Uber promised not to use this “greyball” software against law enforcement – one wonders what would happen to someone carrying a knife who promised never to stab a policeman with it. Travis Kalanick of Uber got a personal dressing down from Tim Cook, who runs Apple, but the company did not prohibit the use of the app. Too much money was at stake for that.\r\n\r\n\r\nMillions of people around the world value the cheapness and convenience of Uber’s rides too much to care about the lack of drivers’ rights or pay. Many of the users themselves are not much richer than the drivers. The “sharing economy” encourages the insecure and exploited to exploit others equally insecure to the profit of a tiny clique of billionaires. Silicon Valley’s culture seems hostile to humane and democratic values. The outgoing CEO of Yahoo, Marissa Mayer, who is widely judged to have been a failure, is likely to get a $186m payout. This may not be a cause for panic, any more than the previous hero worship should have been a cause for euphoria. Yet there’s an urgent political task to tame these companies, to ensure they are punished when they break the law, that they pay their taxes fairly and that they behave responsibly.'

In [103]:
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [104]:
# Tokenize the article into sentences: sentences
sentences = nltk.sent_tokenize(article)

# Tokenize each sentence into words: token_sentences
token_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# Tag each tokenized sentence into parts of speech: pos_sentences
pos_sentences = [nltk.pos_tag(sent) for sent in token_sentences] 

# Create the named entity chunks: chunked_sentences
chunked_sentences = nltk.ne_chunk_sents(pos_sentences, binary=True)

# Test for stems of the tree with 'NE' tags
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, "label") and chunk.label() == "NE":
            print(chunk)

(NE Uber/NNP)
(NE Beyond/NN)
(NE Apple/NNP)
(NE Uber/NNP)
(NE Uber/NNP)
(NE Travis/NNP Kalanick/NNP)
(NE Tim/NNP Cook/NNP)
(NE Apple/NNP)
(NE Silicon/NNP Valley/NNP)
(NE CEO/NNP)
(NE Yahoo/NNP)
(NE Marissa/NNP Mayer/NNP)


In [105]:
from collections import defaultdict
from matplotlib import pyplot as plt

In [None]:
# Create the defaultdict: ner_categories
ner_categories = defaultdict(int)

# Create the nested for loop
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, 'label'):
            ner_categories[chunk.label()] += 1
            
# Create a list from the dictionary keys for the chart labels: labels
labels = list(ner_categories.keys())

# Create a list of the values: values
values = [ner_categories.get(v) for v in labels]

# Create the pie chart
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=140)

# Display the chart
plt.show()

## SpaCy

### Comparing NLTK with spaCy NER

In [107]:
# Import spacy
import spacy

# Instantiate the English model: nlp
nlp = spacy.load('en', tagger=False, parser=False, matcher=False)

# Create a new document: doc
doc = nlp(article)

# Print all of the found entities and their labels
for ent in doc.ents:
    print(ent.label_, ent.text)

ORG Uber
PERSON Uber
ORG unroll.me
ORG Apple
PERSON Travis Kalanick
PERSON Uber
PERSON Tim Cook
ORG Apple
CARDINAL Millions
PERSON Uber
LOC Silicon Valley
ORG Yahoo
PERSON Marissa Mayer
MONEY $186m


## Multilingual NER with polyglot

In [38]:
!pip install polyglot

Collecting polyglot
[?25l  Downloading https://files.pythonhosted.org/packages/e7/98/e24e2489114c5112b083714277204d92d372f5bbe00d5507acf40370edb9/polyglot-16.7.4.tar.gz (126kB)
[K     |██▋                             | 10kB 14.8MB/s eta 0:00:01[K     |█████▏                          | 20kB 2.0MB/s eta 0:00:01[K     |███████▉                        | 30kB 2.5MB/s eta 0:00:01[K     |██████████▍                     | 40kB 2.0MB/s eta 0:00:01[K     |█████████████                   | 51kB 2.2MB/s eta 0:00:01[K     |███████████████▋                | 61kB 2.5MB/s eta 0:00:01[K     |██████████████████▏             | 71kB 2.8MB/s eta 0:00:01[K     |████████████████████▊           | 81kB 3.0MB/s eta 0:00:01[K     |███████████████████████▍        | 92kB 2.9MB/s eta 0:00:01[K     |██████████████████████████      | 102kB 3.0MB/s eta 0:00:01[K     |████████████████████████████▌   | 112kB 3.0MB/s eta 0:00:01[K     |███████████████████████████████▏| 122kB 3.0MB/s eta 0:00:01

In [46]:
!pip install pycld2

Collecting pycld2
[?25l  Downloading https://files.pythonhosted.org/packages/21/d2/8b0def84a53c88d0eb27c67b05269fbd16ad68df8c78849e7b5d65e6aec3/pycld2-0.41.tar.gz (41.4MB)
[K     |████████████████████████████████| 41.4MB 100kB/s 
[?25hBuilding wheels for collected packages: pycld2
  Building wheel for pycld2 (setup.py) ... [?25l[?25hdone
  Created wheel for pycld2: filename=pycld2-0.41-cp36-cp36m-linux_x86_64.whl size=9833516 sha256=2cf3029d7c50d4987e08c928025f79d9a8cd1bc0d21d630ece79b2e7d458ef38
  Stored in directory: /root/.cache/pip/wheels/c6/8f/e9/08a1a8932a490175bd140206cd86a3dbcfc70498100de11079
Successfully built pycld2
Installing collected packages: pycld2
Successfully installed pycld2-0.41


In [48]:
!pip install morfessor

Collecting morfessor
  Downloading https://files.pythonhosted.org/packages/39/e6/7afea30be2ee4d29ce9de0fa53acbb033163615f849515c0b1956ad074ee/Morfessor-2.0.6-py3-none-any.whl
Installing collected packages: morfessor
Successfully installed morfessor-2.0.6


In [51]:
!pip install pyicu

Collecting pyicu
[?25l  Downloading https://files.pythonhosted.org/packages/5a/99/c48c816095208bf3f4936ff67e571621fbddef461303a35a076f234e31f6/PyICU-2.5.tar.gz (225kB)
[K     |█▌                              | 10kB 15.2MB/s eta 0:00:01[K     |███                             | 20kB 2.0MB/s eta 0:00:01[K     |████▍                           | 30kB 2.7MB/s eta 0:00:01[K     |█████▉                          | 40kB 3.0MB/s eta 0:00:01[K     |███████▎                        | 51kB 2.4MB/s eta 0:00:01[K     |████████▊                       | 61kB 2.7MB/s eta 0:00:01[K     |██████████▏                     | 71kB 3.0MB/s eta 0:00:01[K     |███████████▋                    | 81kB 3.2MB/s eta 0:00:01[K     |█████████████                   | 92kB 3.5MB/s eta 0:00:01[K     |██████████████▌                 | 102kB 3.3MB/s eta 0:00:01[K     |████████████████                | 112kB 3.3MB/s eta 0:00:01[K     |█████████████████▍              | 122kB 3.3MB/s eta 0:00:01[K     |█

In [52]:
from icu import Locale
import pycld2 as cld2

In [108]:
from polyglot.text import Text

### French NER with polyglot

In [58]:
!polyglot download LANG:fr

[polyglot_data] Downloading collection 'LANG:fr'
[polyglot_data]    | 
[polyglot_data]    | Downloading package sgns2.fr to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package unipos.fr to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package ner2.fr to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package counts2.fr to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package transliteration2.fr to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package embeddings2.fr to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package uniemb.fr to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package pos2.fr to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package sentiment2.fr to
[polyglot_data]    |     /root/polyglot_data

In [109]:
article = "\ufeffédition abonné\r\n\r\n\r\nDans une tribune au « Monde », l’universitaire Charles Cuvelliez estime que le fantasme d’un remplacement de l’homme par l’algorithme et le robot repose sur un malentendu.\r\n\r\n\r\nLe Monde | 10.05.2017 à 06h44 • Mis à jour le 10.05.2017 à 09h47 | Par Charles Cuvelliez (Professeur à l’Ecole polytechnique de l'université libre de Bruxelles)\r\n\r\n\r\nTRIBUNE. L’usage morbide, par certains, de Facebook Live a amené son fondateur à annoncer précipitamment le recrutement de 3 000 modérateurs supplémentaires. Il est vrai que l’intelligence artificielle (IA) est bien en peine de reconnaître des contenus violents, surtout diffusés en direct.\r\n\r\n\r\nLe quotidien affreux de ces modérateurs, contraints de visionner des horreurs à longueur de journée, mériterait pourtant qu’on les remplace vite par des machines !\r\n\r\n\r\nL’IA ne peut pas tout, mais là où elle peut beaucoup, on la maudit, accusée de détruire nos emplois, de remplacer la convivialité humaine. Ce débat repose sur un malentendu.\r\n\r\n\r\nIl vient d’une définition de l’IA qui n’a, dans la réalité, jamais pu être mise en pratique : en 1955, elle était vue comme la création de programmes informatiques qui, quoi qu’on leur confie, le feraient un jour mieux que les humains. On pensait que toute caractéristique de l’intelligence humaine pourrait un jour être si précisément décrite qu’il suffirait d’une machine pour la simuler. Ce n’est pas vrai.\r\n\r\n\r\nAngoisses infondées\r\n\r\n\r\nComme le dit un récent Livre blanc sur la question (Pourquoi il ne faut pas avoir peur de l’Intelligence arti\xadficielle, Julien Maldonato, Deloitte, mars 2017), rien ne pourra remplacer un humain dans sa globalité.\r\n\r\n\r\nL’IA, c’est de l’apprentissage automatique doté d’un processus d’ajustement de modèles statistiques à des masses de données, explique l’auteur. Il s’agit d’un apprentissage sur des paramètres pour lesquels une vision humaine n’explique pas pourquoi ils marchent si bien dans un contexte donné.\r\n\r\n\r\nC’est aussi ce que dit le rapport de l’Office parlementaire d’évaluation des choix scientifiques et technologiques (« Pour une intelligence artificielle maîtrisée, utile et démystifiée », 29 mars 2017), pour qui ce côté « boîte noire » explique des angoisses infondées. Ethiquement, se fonder sur l’IA pour des tâches critiques sans bien comprendre le comment..."

In [110]:
# Create a new text object using Polyglot's Text class: txt
txt = Text(article)

# Print each of the entities found
for ent in txt.entities:
    print(ent)
    
# Print the type of ent
print(type(ent))

['Charles', 'Cuvelliez']
['Charles', 'Cuvelliez']
['Bruxelles']
['l’IA']
['Julien', 'Maldonato']
['Deloitte']
['Ethiquement']
['l’IA']
['.']
<class 'polyglot.text.Chunk'>


In [111]:
# Create the list of tuples: entities
entities = [(ent.tag, ' '.join(ent)) for ent in txt.entities]

# Print entities
print(entities)

[('I-PER', 'Charles Cuvelliez'), ('I-PER', 'Charles Cuvelliez'), ('I-ORG', 'Bruxelles'), ('I-PER', 'l’IA'), ('I-PER', 'Julien Maldonato'), ('I-ORG', 'Deloitte'), ('I-PER', 'Ethiquement'), ('I-LOC', 'l’IA'), ('I-PER', '.')]


### All languages from polyglot

In [68]:
from polyglot.downloader import downloader
downloader.download("embeddings2.en")

[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /root/polyglot_data...


True

In [69]:
downloader.download("TASK:transliteration2", quiet=True)

[polyglot_data] Error downloading 'transliteration2.pl' from <http://p
[polyglot_data]     olyglot.cs.stonybrook.edu/~polyglot/transliteratio
[polyglot_data]     n2/pl/transliteration.pl.tar.bz2>:   HTTP Error
[polyglot_data]     403: Forbidden


False

In [70]:
downloader.list(show_packages=False)

Using default data directory (/root/polyglot_data)
 Data server index for <http://polyglot.cs.stonybrook.edu/~polyglot/>
Collections:
  [P] LANG:af............. Afrikaans            packages and models
  [ ] LANG:als............ Alemannic            packages and models
  [P] LANG:am............. Amharic              packages and models
  [ ] LANG:an............. Aragonese            packages and models
  [P] LANG:ar............. Arabic               packages and models
  [ ] LANG:arz............ Egyptian Arabic      packages and models
  [ ] LANG:as............. Assamese             packages and models
  [ ] LANG:ast............ Asturian             packages and models
  [P] LANG:az............. Azerbaijani          packages and models
  [ ] LANG:ba............. Bashkir              packages and models
  [ ] LANG:bar............ Bavarian             packages and models
  [P] LANG:be............. Belarusian           packages and models
  [P] LANG:bg............. Bulgarian            pa

### Spanish NER with polyglot

In [71]:
!polyglot download LANG:es

[polyglot_data] Downloading collection 'LANG:es'
[polyglot_data]    | 
[polyglot_data]    | Downloading package sgns2.es to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package unipos.es to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package ner2.es to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package counts2.es to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package transliteration2.es to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    |   Package transliteration2.es is already up-to-
[polyglot_data]    |       date!
[polyglot_data]    | Downloading package embeddings2.es to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package uniemb.es to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package pos2.es to
[polyglot_data]    |     /root/polyglot_data...


In [119]:
article = "Lina del Castillo es profesora en el Instituto de Estudios Latinoamericanos Teresa Lozano Long (LLILAS) y el Departamento de Historia de la Universidad de Texas en Austin. Ella será la moderadora del panel “Los Mundos Políticos de Gabriel García Márquez” este viernes, Oct. 30, en el simposio Gabriel García Márquez: Vida y Legado.LIna del Castillo Actualmente, sus investigaciones abarcan la intersección de cartografía, disputas a las demandas de tierra y recursos, y la formación del n...el tren de medianoche que lleva a miles y miles de cadáveres uno encima del otro como tantos racimos del banano que acabarán tirados al mar. Ningún recuento periodístico podría provocar nuestra imaginación y nuestra memoria como este relato de García Márquez. Contenido Relacionado Lea más artículos sobre el archivo de Gabriel García Márquez Reciba mensualmente las últimas noticias e información del Harry Ransom Center con eNews, nuestro correo electrónico mensual. ¡Suscríbase hoy!"

In [122]:
# Create a new text object using Polyglot's Text class: txt
txt = Text(article)

# Print each of the entities found
for ent in txt.entities:
    print(ent)
    
# Print the type of ent
print(type(ent))

['Lina']
['Castillo']
['Teresa', 'Lozano', 'Long']
['Universidad', 'de', 'Texas']
['Austin']
['Austin', '.']
['Austin', '.', 'Ella']
['Gabriel', 'García', 'Márquez']
['Gabriel', 'García', 'Márquez']
['Legado.LIna']
['Castillo']
['García', 'Márquez']
['Gabriel', 'García', 'Márquez']
['Harry', 'Ransom']
['Harry', 'Ransom', 'Center']
<class 'polyglot.text.Chunk'>


In [123]:
# Create the list of tuples: entities
entities = [(ent.tag, ' '.join(ent)) for ent in txt.entities]

# Print entities
print(entities)

[('I-PER', 'Lina'), ('I-PER', 'Castillo'), ('I-PER', 'Teresa Lozano Long'), ('I-ORG', 'Universidad de Texas'), ('I-PER', 'Austin'), ('I-LOC', 'Austin .'), ('I-PER', 'Austin . Ella'), ('I-PER', 'Gabriel García Márquez'), ('I-PER', 'Gabriel García Márquez'), ('I-PER', 'Legado.LIna'), ('I-LOC', 'Castillo'), ('I-PER', 'García Márquez'), ('I-PER', 'Gabriel García Márquez'), ('I-PER', 'Harry Ransom'), ('I-ORG', 'Harry Ransom Center')]


In [124]:
# Create a new text object using Polyglot's Text class: txt
txt = Text(es_article)

# Initialize the count variable: count
count = 0

# Iterate over all the entities
for ent in txt.entities:
    # Check whether the entity contains 'Márquez' or 'Gabo'
    if "Márquez" in ent or "Gabo" in ent:
        # Increment count
        count += 1

# Print count
print(count)

# Calculate the percentage of entities that refer to "Gabo": percentage
percentage = count / len(txt.entities)
print(percentage)

4
0.26666666666666666


In [86]:
!polyglot download LANG:hi

[polyglot_data] Downloading collection 'LANG:hi'
[polyglot_data]    | 
[polyglot_data]    | Downloading package sgns2.hi to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package unipos.hi to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package ner2.hi to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package counts2.hi to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package transliteration2.hi to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    |   Package transliteration2.hi is already up-to-
[polyglot_data]    |       date!
[polyglot_data]    | Downloading package embeddings2.hi to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package uniemb.hi to
[polyglot_data]    |     /root/polyglot_data...
[polyglot_data]    | Downloading package sentiment2.hi to
[polyglot_data]    |     /root/polyglot_da

### Hindi NER with polyglot

In [160]:
article = "मेरा नाम तनय मुख़र्जी है | मैं इंदौर का रहने वाला हूँ | मैं रिलायंस कंपनी के लिए काम करता हूँ | भारत सरकार के तत्कालीन प्रधान मंत्री नरेंद्र मोदी है | भारत के राष्ट्र पिता का नाम महात्मा गाँधी है |"

In [161]:
# Create a new text object using Polyglot's Text class: txt
txt = Text(article)

# Print each of the entities found
for ent in txt.entities:
    print(ent)
    
# Print the type of ent
print(type(ent))

['इंदौर']
['रिलायंस']
['भारत', 'सरकार']
['नरेंद्र', 'मोदी']
['भारत']
['महात्मा', 'गाँधी']
<class 'polyglot.text.Chunk'>


In [162]:
# Create the list of tuples: entities
entities = [(ent.tag, ' '.join(ent)) for ent in txt.entities]

# Print entities
print(entities)

[('I-LOC', 'इंदौर'), ('I-ORG', 'रिलायंस'), ('I-ORG', 'भारत सरकार'), ('I-PER', 'नरेंद्र मोदी'), ('I-LOC', 'भारत'), ('I-PER', 'महात्मा गाँधी')]
