#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2020


# Polyglot

## Introduction

In today's world, the data on the internet is increasing exponentially and it is impossible to analyse this data manually. This has lead to a lot of research in the field of Text Analysis and Natural Language Processing (NLP) using computers. Using computers for analysing text can provide some useful and inspirational insights. There are vatious different libraries available for this, NLTK and SpaCy are the most famous of the bunch. However, when working with multiple languages the capabilities of these existing models are severely limited. In order to combat this limitation, Rami Al-Rfou built polyglot. 

Polyglot is a natural language pipeline that supports massive multilingual applications. It is very similar to Textblob, so it is very easy to learn one if you know the other. The features include:
    
- [Language Detection](#languageDetection)
- [Word Tokenization and Sentence Segmentation](#tokenization)
- [Word Embeddings](#wordEmbeddings)
- [Part of Speech (POS) Tagging](#posTagging)
- [Named Entity Recognition (NER)](#ner)
- [Morphological Analysis](#morphologicalAnalysis)
- [Transliteration](#transliteration)
- [Sentiment Analysis](#sentimentAnalysis)

This notebook will explore this features with examples in multiple languages.


## Installation

Run the following commands to install polyglot and it's dependencies.
Refer the following link for more info: [polyglot](https://polyglot.readthedocs.io/en/latest/Installation.html)


```python
pip install polyglot
pip install PyICU
pip install pycld2
pip install morfessor
```

## Getting Started

Before we get started, there are some models that needs to be downloaded in order for polyglot to function properly. Since this models are pretty large, they are distributed using a download manager separately. 

There are 2 ways of downloading the models:
- **Interactive**

In [None]:
from polyglot.downloader import downloader
downloader.download("embeddings2.en")
downloader.download("pos2.en")
downloader.download("ner2.en")
downloader.download("transliteration2.ar")
downloader.download("morph2.en")
downloader.download("morph2.ar")
downloader.download("sentiment2.en")

- **Command line (bash)**

In [None]:
%%bash
polyglot download embeddings2.en pos2.en ner2.en transliteration2.gu morph2.en morph2.ar sentiment2.en

**Here are all the available and installed models and packages**

In [179]:
downloader.list(show_packages=False)

Using default data directory (/home/phoenix1712/polyglot_data)
 Data server index for <http://polyglot.cs.stonybrook.edu/~polyglot/>
Collections:
  [ ] LANG:af............. Afrikaans            packages and models
  [ ] LANG:als............ Alemannic            packages and models
  [ ] LANG:am............. Amharic              packages and models
  [ ] LANG:an............. Aragonese            packages and models
  [P] LANG:ar............. Arabic               packages and models
  [ ] LANG:arz............ Egyptian Arabic      packages and models
  [ ] LANG:as............. Assamese             packages and models
  [ ] LANG:ast............ Asturian             packages and models
  [ ] LANG:az............. Azerbaijani          packages and models
  [ ] LANG:ba............. Bashkir              packages and models
  [ ] LANG:bar............ Bavarian             packages and models
  [ ] LANG:be............. Belarusian           packages and models
  [ ] LANG:bg............. Bulgarian  

<a id='languageDetection'></a>
## Language Detection

Language detection in a multi-language text input plays a crucial role in proper tokenization and helps analyse (language specific analysis) the data in an efficient way.

Polyglot can be used to detect languages present in the text. It can also detect multiple languages and give a confidence score to each detection. 

Sometimes, there isn't enough data to detect a language, for e.g. from a single word. In such cases, this forces the detector to switch to a best effort strategy, a warning will be thrown and the attribute reliable will be set to False.

In [24]:
# Single Language

from polyglot.detect import Detector

arabic_text = u"""
أفاد مصدر امني في قيادة عمليات صلاح الدين في العراق بأن " القوات الامنية تتوقف لليوم
الثالث على التوالي عن التقدم الى داخل مدينة تكريت بسبب
انتشار قناصي التنظيم الذي يطلق على نفسه اسم "الدولة الاسلامية" والعبوات الناسفة
والمنازل المفخخة والانتحاريين، فضلا عن ان القوات الامنية تنتظر وصول تعزيزات اضافية ".
"""
detector = Detector(arabic_text)
print(detector.language)

french_text = "Bonjour, Mesdames."
detector = Detector(french_text)
print(detector.language)

name: Arabic      code: ar       confidence:  99.0 read bytes:   907
name: French      code: fr       confidence:  94.0 read bytes:  1204


In [19]:
# Multiple Langauages

from polyglot.detect import Detector

mixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state located in East Asia.
"""

for language in Detector(mixed_text).languages:
    print(language)

name: English     code: en       confidence:  87.0 read bytes:  1154
name: Chinese     code: zh_Hant  confidence:   5.0 read bytes:  1755
name: un          code: un       confidence:   0.0 read bytes:     0


<a id='tokenization'></a>
## Word Tokenization and Sentence Segmentation

While working with text data, it is important to recogonize boundaries between words and sentences. Aboe mentioned techniques are used to achieve this. There are 2 ways of doing this, either by finding the boundaries of the sentence first and further breaking the sentence into words, or we could identify the words first and then segment them to make a sentence.

Here are some examples in English and Mandarin.

In [38]:
# English

text = Text("Australia posted a World Cup record total of 417 - 6 as they beat Afghanistan by 275 runs ."
"David Warner hit 178 off 133 balls , Steve Smith scored 95 while Glenn Maxwell struck 88 in 39 deliveries in the Pool A encounter in Perth ."
"Afghanistan were then dismissed for 142 , with Mitchell Johnson and Mitchell Starc taking six wickets between them ."
"Australia's score surpassed the 413 - 5 India made against Bermuda in 2007 ."
"It continues the pattern of bat dominating ball in this tournament as the third 400 plus score achieved in the pool stages , following South Africa's 408 - 5 and 411 - 4 against West Indies and Ireland respectively ."
"The winning margin beats the 257 - run amount by which India beat Bermuda in Port of Spain in 2007 , which was equalled five days ago by South Africa in their victory over West Indies in Sydney .")

In [39]:
# word tokenization
text.words

WordList(['Australia', 'posted', 'a', 'World', 'Cup', 'record', 'total', 'of', '417', '-', '6', 'as', 'they', 'beat', 'Afghanistan', 'by', '275', 'runs', '.', 'David', 'Warner', 'hit', '178', 'off', '133', 'balls', ',', 'Steve', 'Smith', 'scored', '95', 'while', 'Glenn', 'Maxwell', 'struck', '88', 'in', '39', 'deliveries', 'in', 'the', 'Pool', 'A', 'encounter', 'in', 'Perth', '.', 'Afghanistan', 'were', 'then', 'dismissed', 'for', '142', ',', 'with', 'Mitchell', 'Johnson', 'and', 'Mitchell', 'Starc', 'taking', 'six', 'wickets', 'between', 'them', '.', "Australia's", 'score', 'surpassed', 'the', '413', '-', '5', 'India', 'made', 'against', 'Bermuda', 'in', '2007', '.', 'It', 'continues', 'the', 'pattern', 'of', 'bat', 'dominating', 'ball', 'in', 'this', 'tournament', 'as', 'the', 'third', '400', 'plus', 'score', 'achieved', 'in', 'the', 'pool', 'stages', ',', 'following', 'South', "Africa's", '408', '-', '5', 'and', '411', '-', '4', 'against', 'West', 'Indies', 'and', 'Ireland', 'respec

In [40]:
# sentence segmentation
text.sentences

[Sentence("Australia posted a World Cup record total of 417 - 6 as they beat Afghanistan by 275 runs ."),
 Sentence("David Warner hit 178 off 133 balls , Steve Smith scored 95 while Glenn Maxwell struck 88 in 39 deliveries in the Pool A encounter in Perth ."),
 Sentence("Afghanistan were then dismissed for 142 , with Mitchell Johnson and Mitchell Starc taking six wickets between them ."),
 Sentence("Australia's score surpassed the 413 - 5 India made against Bermuda in 2007 ."),
 Sentence("It continues the pattern of bat dominating ball in this tournament as the third 400 plus score achieved in the pool stages , following South Africa's 408 - 5 and 411 - 4 against West Indies and Ireland respectively ."),
 Sentence("The winning margin beats the 257 - run amount by which India beat Bermuda in Port of Spain in 2007 , which was equalled five days ago by South Africa in their victory over West Indies in Sydney .")]

In [41]:
# words present in first sentence
text.sentences[0].words

WordList(['Australia', 'posted', 'a', 'World', 'Cup', 'record', 'total', 'of', '417', '-', '6', 'as', 'they', 'beat', 'Afghanistan', 'by', '275', 'runs', '.'])

In [46]:
# Mandarin

text = u"""
两个月前遭受恐怖袭击的法国巴黎的犹太超市在装修之后周日重新开放，法国内政部长以及超市的管理者都表示，这显示了生命力要比野蛮行为更强大。
该超市1月9日遭受枪手袭击，导致4人死亡，据悉这起事件与法国《查理周刊》杂志社恐怖袭击案有关。
"""
text = Text(text)
print(text.language)

name: Chinese     code: zh       confidence:  99.0 read bytes:  1920


In [47]:
# word tokenization
text.words

WordList(['两', '个', '月', '前', '遭受', '恐怖', '袭击', '的', '法国', '巴黎', '的', '犹太', '超市', '在', '装修', '之后', '周日', '重新', '开放', '，', '法国', '内政', '部长', '以及', '超市', '的', '管理者', '都', '表示', '，', '这', '显示', '了', '生命力', '要', '比', '野蛮', '行为', '更', '强大', '。', '该', '超市', '1', '月', '9', '日', '遭受', '枪手', '袭击', '，', '导致', '4', '人', '死亡', '，', '据悉', '这', '起', '事件', '与', '法国', '《', '查理', '周刊', '》', '杂志', '社', '恐怖', '袭击', '案', '有关', '。'])

In [48]:
# sentence segmentation
text.sentences

[Sentence("两个月前遭受恐怖袭击的法国巴黎的犹太超市在装修之后周日重新开放，法国内政部长以及超市的管理者都表示，这显示了生命力要比野蛮行为更强大。"),
 Sentence("该超市1月9日遭受枪手袭击，导致4人死亡，据悉这起事件与法国《查理周刊》杂志社恐怖袭击案有关。")]

<a id='wordEmbeddings'></a>
## Word Embeddings

Logically it would make sense for words with similar meaning to have a representation which is close. But normal one hot encoding does not achieve this and moreover it is extremely sparse. To resolve this problem, the words are represented in a space where words with similar meaning have close representation, such representations are called word embeddings. 

The Embedding class in polyglot can read word embeddings from different sources:

+  Gensim word2vec objects: (from_gensim method)
+ Word2vec binary/text models: (from_word2vec method)
+ GloVe models (from_glove method)
+ polyglot pickle files: (load method)

The word embeddings are not unit vectors, actually the more frequent the word is the larger the norm of its own vector. But in most of the machine learning tasks such as classification and training of RNNs, normlised weights are required. Polyglot provides an easy way to normalise the embeddings.

We can use the above download embedding model for this example.

In [98]:
# load the previously downloaded embedding mdoel
from polyglot.mapping import Embedding
embeddings = Embedding.load("/home/phoenix1712/polyglot_data/embeddings2/en/embeddings_pkl.tar.bz2")

In [99]:
# get neightbours of the word "Retrieval" in the embedding space..
embeddings.nearest_neighbors("Retrieval")

['Visualization',
 'Coding',
 'Retention',
 'Manipulation',
 'Pricing',
 'Validation',
 'Estimation',
 'Projections',
 'Forecasting',
 'Sampling']

In [101]:
# list the distance of each nearest neighbour
neighbors = embeddings.nearest_neighbors("Retrieval")
for w,d in zip(neighbors, embeddings.distances("Retrieval", neighbors)):
    print("{:<8} - {:.4f}".format(w,d))

Visualization - 1.2541
Coding   - 1.2838
Retention - 1.3252
Manipulation - 1.3609
Pricing  - 1.3930
Validation - 1.4006
Estimation - 1.4009
Projections - 1.4028
Forecasting - 1.4135
Sampling - 1.4180


In [102]:
# normalise the embedding weights and list the distance of nearest neighbours
embeddings = embeddings.normalize_words()
neighbors = embeddings.nearest_neighbors("Retrieval")
for w,d in zip(neighbors, embeddings.distances("Retrieval", neighbors)):
    print("{:<8} - {:.4f}".format(w,d))

Visualization - 0.5469
Coding   - 0.5580
Retention - 0.5964
Manipulation - 0.6050
Estimation - 0.6137
Validation - 0.6140
Forecasting - 0.6177
Enhancement - 0.6191
Pricing  - 0.6227
Projections - 0.6248


### Vocabulary Expansion

The word dictionary constructed by the embedding models does not contain all words. There are some basic epansion tools available in polyglot such as:
- **Case Expansion** :
It basically finds the case changed version of the word in the embedding dictionary.
- **Digit Expansion** :
To reduce the size of the vocabulary while training the embeddings, special classes of words are grouped. One common case of such grouping is digits. Every digit in the training corpus get replaced by the symbol #. For example, a number like 123.54 becomes ###.##. Therefore, querying the embedding for a new number like 670 will result in a failure.

In [103]:
# "RETRIEVAL" is not present in the embeddings
"RETRIEVAL" in embeddings

False

In [104]:
# Case Expansion
from polyglot.mapping import CaseExpander
embeddings.apply_expansion(CaseExpander)
"RETRIEVAL" in embeddings

True

In [108]:
embeddings.nearest_neighbors("RETRIEVAL")

['validation',
 'utilization',
 'calibration',
 'synchronization',
 'visualization',
 'optimization',
 'usability',
 'stabilization',
 'reliability',
 'pricing']

In [105]:
# digit 670 is not available in the embedding dictionary because digits are changed to # in the dictionary.
"670" in embeddings

False

In [106]:
# Digit Expansion
from polyglot.mapping import DigitExpander
embeddings.apply_expansion(DigitExpander)
"670" in embeddings

True

In [107]:
embeddings.nearest_neighbors("670")

['##', '#', '3', '#####', '#,###', '##,###', '##EN##', '####', '###EN###', 'n']

<a id='posTagging'></a>
## Part of Speech Tagging

It is very important to know what part of language syntax structure each word in a sentence belongs to. The process of assigning a syntactic identity to a word in a sentence is called Part of Speech (POS) Tagging.

Polyglot recognizes 17 parts of speech, this set is called the universal part of speech tag set:

- **ADJ**: adjective
- **ADP**: adposition
- **ADV**: adverb
- **AUX**: auxiliary verb
- **CONJ**: coordinating conjunction
- **DET**: determiner
- **INTJ**: interjection
- **NOUN**: noun
- **NUM**: numeral
- **PART**: particle
- **PRON**: pronoun
- **PROPN**: proper noun
- **PUNCT**: punctuation
- **SCONJ**: subordinating conjunction
- **SYM**: symbol
- **VERB**: verb
- **X**: other

The models were trained on a combination of:

- Original CONLL datasets after the tags were converted using the [universal POS tables](https://universaldependencies.org/docs/tagset-conversion/index.html).
- Universal Dependencies 1.0 corpora whenever they are available.

In [109]:
# Supported languages for POS tagging
from polyglot.downloader import downloader
print(downloader.supported_languages_table("pos2"))

  1. Italian                    2. French                     3. Spanish; Castilian       
  4. Bulgarian                  5. Slovene                    6. Irish                    
  7. Finnish                    8. Dutch                      9. Swedish                  
 10. Danish                    11. Portuguese                12. English                  
 13. German                    14. Indonesian                15. Czech                    
 16. Hungarian                


In [148]:
# Getting the tagging for a given sentence.
from polyglot.text import Text
blob = """Texas A&M ruins Auburn’s undefeated home record in College Basketball."""
text = Text(blob)
text.pos_tags

[('Texas', 'PROPN'),
 ('A', 'NOUN'),
 ('&', 'CONJ'),
 ('M', 'PROPN'),
 ('ruins', 'NOUN'),
 ('Auburn’s', 'NUM'),
 ('undefeated', 'ADJ'),
 ('home', 'NOUN'),
 ('record', 'NOUN'),
 ('in', 'ADP'),
 ('College', 'PROPN'),
 ('Basketball', 'PROPN'),
 ('.', 'PUNCT')]

<a id='ner'></a>
## Named Entity Recognition

Named Entity Recognition (NER) is the task of locating and classifying the named entities in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.. It can be thought of as a slot filling problem which requires understanding of the text syntax and structure.  

Polyglot recognizes 3 categories of entities:

- Locations (Tag: I-LOC): cities, countries, regions, continents, neighborhoods, administrative divisions …
- Organizations (Tag: I-ORG): sports teams, newspapers, banks, universities, schools, non-profits, companies, …
- Persons (Tag: I-PER): politicians, scientists, artists, atheletes …

In [114]:
# Languages supported by the NER in polyglot
from polyglot.downloader import downloader
print(downloader.supported_languages_table("ner2"))

  1. Italian                    2. Hindi                      3. French                   
  4. Spanish; Castilian         5. Vietnamese                 6. Arabic                   
  7. Bulgarian                  8. Norwegian                  9. Estonian                 
 10. Japanese                  11. Greek, Modern             12. Slovene                  
 13. Korean                    14. Serbian                   15. Finnish                  
 16. Catalan; Valencian        17. Croatian                  18. Dutch                    
 19. Swedish                   20. Tagalog                   21. Danish                   
 22. Latvian                   23. Ukrainian                 24. Romanian, Moldavian, ... 
 25. Persian                   26. Slovak                    27. Portuguese               
 28. English                   29. Malay                     30. Polish                   
 31. German                    32. Indonesian                33. Chinese                  

In [149]:
from polyglot.text import Text
blob = """
Joe Biden is leading the race for the Democratic presidential nomination in one night and supplanted Bernie Sanders as the front-runner.
Democrats should be wary of backing Joe Biden and his ‘senior moments’
But then a protester leaped on stage and Biden’s wife, Jill, 68, instinctively stepped between him and the lunging vegan, grabbed the woman by the wrists and shoved her away.
"""
text = Text(blob)
text.entities

[I-PER(['Joe', 'Biden']),
 I-ORG(['Democratic']),
 I-PER(['Bernie', 'Sanders']),
 I-PER(['Joe', 'Biden']),
 I-PER(['Jill'])]

In [153]:
# Checking the tokens sentence-wise
for sent in text.sentences:
    print("-> ", sent, "\n")
    for entity in sent.entities:
        print(entity.tag, entity)
    print()

->  Joe Biden is leading the race for the Democratic presidential nomination in one night and supplanted Bernie Sanders as the front-runner. 

I-PER ['Joe', 'Biden']
I-ORG ['Democratic']
I-PER ['Bernie', 'Sanders']

->  Democrats should be wary of backing Joe Biden and his ‘senior moments’ 

I-PER ['Joe', 'Biden']

->  But then a protester leaped on stage and Biden’s wife, Jill, 68, instinctively stepped between him and the lunging vegan, grabbed the woman by the wrists and shoved her away. 

I-PER ['Jill']



In [161]:
# locate the position of the entity within the sentence.
sent = text.sentences[0]
print(sent)
bernie = sent.entities[2]
print(bernie)
sent.words[bernie.start: bernie.end]
print("Start Word Index: ", bernie.start)
print("End Word Index: ", bernie.end)

Joe Biden is leading the race for the Democratic presidential nomination in one night and supplanted Bernie Sanders as the front-runner.
['Bernie', 'Sanders']
Start Word Index:  16
End Word Index:  18


<a id='morphologicalAnalysis'></a>
## Morphological Analysis

Polyglot offers trained morfessor models to generate morphemes from words. The goal of the Morpho project is to develop unsupervised data-driven methods that discover the regularities behind word forming in natural languages.

Morphemes are the primitive units of syntax, the smallest individually meaningful elements in the utterances of a language. They are essential for the task of language generation and detection. They can also be used in detected the word tokens in improperly tokenised text (as shown in example below).

In [129]:
# languages supported for morphenes creation in polyglot
from polyglot.downloader import downloader
print(downloader.supported_languages_table("morph2"))

  1. Kapampangan                2. Italian                    3. Upper Sorbian            
  4. Sakha                      5. Hindi                      6. French                   
  7. Spanish; Castilian         8. Vietnamese                 9. Arabic                   
 10. Macedonian                11. Pashto, Pushto            12. Bosnian-Croatian-Serbian 
 13. Egyptian Arabic           14. Norwegian Nynorsk         15. Sundanese                
 16. Sicilian                  17. Azerbaijani               18. Bulgarian                
 19. Yoruba                    20. Tajik                     21. Georgian                 
 22. Tatar                     23. Galician                  24. Malagasy                 
 25. Uighur, Uyghur            26. Amharic                   27. Venetian                 
 28. Yiddish                   29. Norwegian                 30. Alemannic                
 31. Estonian                  32. West Flemish              33. Divehi; Dhivehi; Mald... 

In [163]:
# generating morphemes for some common words
from polyglot.text import Text, Word
words = ["Information", "retrieval", "and", "storage", "natural", "language", "processing"]
for w in words:
    w = Word(w, language="en")
    print("{:<20}{}".format(w, w.morphemes))

Information         ['In', 'form', 'ation']
retrieval           ['retriev', 'al']
and                 ['and']
storage             ['stor', 'age']
natural             ['natural']
language            ['language']
processing          ['process', 'ing']


In [165]:
# using morphemes to properly tokenise a string
blob = "Springbreakishere."
text = Text(blob)
text.language = "en"
text.morphemes

WordList(['Spring', 'break', 'is', 'here', '.'])

<a id='transliteration'></a>
## Transliteration

Transliteration is the process of transferring a word from the alphabet of one language to another. This is a helpful tool to have to get started with tool to print words in different languages while preserving the pronouncition.

polyglot offers support for almost 70 languages.

In [134]:
# language support for Transliteration
from polyglot.downloader import downloader
print(downloader.supported_languages_table("transliteration2"))

  1. Italian                    2. Hindi                      3. French                   
  4. Spanish; Castilian         5. Vietnamese                 6. Arabic                   
  7. Macedonian                 8. Bosnian-Croatian-Serbian   9. Norwegian Nynorsk        
 10. Azerbaijani               11. Bulgarian                 12. Georgian                 
 13. Galician                  14. Amharic                   15. Yiddish                  
 16. Norwegian                 17. Estonian                  18. Japanese                 
 19. Haitian; Haitian Creole   20. Belarusian                21. Greek, Modern            
 22. Welsh                     23. Albanian                  24. Marathi (Marāṭhī)        
 25. Armenian                  26. Slovene                   27. Korean                   
 28. Irish                     29. Bengali                   30. Serbian                  
 31. Finnish                   32. Catalan; Valencian        33. Croatian                 

In [168]:
# use polyglot to transliterate words from english to gujarati
from polyglot.transliteration import Transliterator
from polyglot.text import Text

blob = """This is a cool feature to have at your disposal"""
text = Text(blob)
for x in text.transliterate("gu"):
    print(x)

થીસ
ીસ
ા
કોોલ
ફેતુરે
ટો
હાવે
ત
યોુર
ડીસપોસાલ


<a id='sentimentAnalysis'></a>
## Sentiment Analysis

Polyglot has polarity lexicons for 136 languages. The scale of the words’ polarity consisted of three degrees: +1 for positive words, and -1 for negatives words. Neutral words will have a score of 0.

In [137]:
# language coverage
from polyglot.downloader import downloader
print(downloader.supported_languages_table("sentiment2"))

  1. Kapampangan                2. Italian                    3. Upper Sorbian            
  4. Sakha                      5. Hindi                      6. French                   
  7. Spanish; Castilian         8. Vietnamese                 9. Arabic                   
 10. Macedonian                11. Pashto, Pushto            12. Bosnian-Croatian-Serbian 
 13. Egyptian Arabic           14. Norwegian Nynorsk         15. Sundanese                
 16. Sicilian                  17. Azerbaijani               18. Bulgarian                
 19. Yoruba                    20. Tajik                     21. Georgian                 
 22. Tatar                     23. Galician                  24. Malagasy                 
 25. Uighur, Uyghur            26. Amharic                   27. Venetian                 
 28. Yiddish                   29. Norwegian                 30. Alemannic                
 31. Estonian                  32. West Flemish              33. Divehi; Dhivehi; Mald... 

In [178]:
# getting polarity for the words in a sentence
text = Text("Language coverage for this superb feature is amazing and terribly slow.")
print("{:<16}{}".format("Word", "Polarity")+"\n"+"-"*30)
for w in text.words:
    print("{:<16}{:>2}".format(w, w.polarity))

Word            Polarity
------------------------------
Language         0
coverage         0
for              0
this             0
superb           1
feature          0
is               0
amazing          1
and              0
terribly        -1
slow            -1
.                0


In [169]:
# getting polarity for the entitites in a sentence
blob = ("Barack Obama gave a fantastic speech last night. "
        "Reports indicate he will move next to New Hampshire.")
text = Text(blob)
first_sentence = text.sentences[0]
print(first_sentence)
first_entity = first_sentence.entities[0]
print(first_entity)
print(first_entity.positive_sentiment)
print(first_entity.negative_sentiment)

Barack Obama gave a fantastic speech last night.
['Barack', 'Obama']
0.9444444444444444
0


## Conclusion

There are a lot of libraries available out there which has a list of feature more extensive than the one offered by polyglot. But polyglot offers this features in a plethora of languages and it is very easy to configure and use, when compared to bigshot libraries like NLTK and SpaCy. Thanks to NumPy, it also works really fast. Using polyglot is similar to spaCy – it’s very efficient, straightforward, and basically an excellent choice for projects involving a language SpaCy doesn’t support. The library stands out from the crowd also because it requests the usage of a dedicated command in the command line through the pipeline mechanisms. 

## References

- [https://polyglot.readthedocs.io/en/latest/Sentiment.html](https://polyglot.readthedocs.io/en/latest/Sentiment.html)
- [https://github.com/aboSamoor/polyglot](https://github.com/aboSamoor/polyglot)
- [https://medium.com/activewizards-machine-learning-company/comparison-of-top-6-python-nlp-libraries-c4ce160237eb](https://medium.com/activewizards-machine-learning-company/comparison-of-top-6-python-nlp-libraries-c4ce160237eb)
- [https://www.pythonpodcast.com/polyglot-with-rami-al-rfou-episode-190/](https://www.pythonpodcast.com/polyglot-with-rami-al-rfou-episode-190/)