# NLP with Polyglot

### Installing on Linux
* pip install polyglot
* pip install pyicu
* pip install pycld2
* pip install morfessor

#### For polyglot download
* polyglot download embeddings2.en
* polyglot download ner2.en
* polyglot download sentiment2.en
* polyglot download pos2.en
* polyglot download morph2.en
* polyglot download transliteration2.ar

### Tokenization
* Splitting text into words

In [2]:
# Load packages
import polyglot
from polyglot.text import Text,Word

In [3]:
docx = Text(u"He likes reading and painting")

In [4]:
# Word Tokens
docx.words

WordList(['He', 'likes', 'reading', 'and', 'painting'])

In [5]:
docx2 = Text(u"He exclaimed, 'what're you doing? Reading?'.")

In [6]:
docx2.words

WordList(['He', 'exclaimed', ',', "'", "what're", 'you', 'doing', '?', 'Reading', '?', "'", '.'])

In [7]:
# Sentence tokens
docx3 = Text(u"He likes reading and painting. He exlaimed, 'what're you doing? Reading?'.")

In [8]:
docx3.sentences

[Sentence("He likes reading and painting."),
 Sentence("He exlaimed, 'what're you doing?"),
 Sentence("Reading?'.")]

### Parts of Speech Tagging
* polyglot download embeddings2.la
* pos_tags

In [9]:
docx

Text("He likes reading and painting")

In [10]:
docx.pos_tags

[('He', 'PRON'),
 ('likes', 'VERB'),
 ('reading', 'VERB'),
 ('and', 'CONJ'),
 ('painting', 'NOUN')]

### Language Detection
* polyglot.detect
* language.name
* language.code

In [11]:
docx

Text("He likes reading and painting")

In [12]:
docx.language.name

'English'

In [13]:
docx.language.code

'en'

In [14]:
from polyglot.detect import Detector

In [15]:
en_text = "He is a student "
fr_text = "Il est un étudiant"
ru_text = "Он студент"

In [16]:
detect_en = Detector(en_text)
detect_fr = Detector(fr_text)
detect_ru = Detector(ru_text)

Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.


In [17]:
print(detect_en.language)

name: English     code: en       confidence:  94.0 read bytes:   704


In [18]:
print(detect_fr.language)

name: French      code: fr       confidence:  95.0 read bytes:   870


In [19]:
print(detect_ru.language)

name: Serbian     code: sr       confidence:  95.0 read bytes:   614


### Sentiment Analysis
* polarity

In [20]:
docx4 = Text(u"He hates reading and playing")

In [21]:
docx

Text("He likes reading and painting")

In [22]:
docx.polarity

1.0

In [23]:
docx4.polarity

-1.0

### Named Entities
* entities

In [24]:
docx5 = Text(u"John Jones was a FBI detector")

In [25]:
docx5.entities

[I-PER(['John', 'Jones']), I-ORG(['FBI'])]

### Morphology
* morpheme is the smallest grammatical unit in a language.
* morpheme may or may not stand alone, word, by definition, is freestanding.
* morphemes

In [26]:
docx6 = Text(u"preprocessing")

In [27]:
docx6.morphemes

Detector is not able to detect the language reliably.


WordList(['pre', 'process', 'ing'])

### Transliteration

In [28]:
# Load
from polyglot.transliteration import Transliterator
translit = Transliterator(source_lang='en',target_lang='fr')

In [29]:
translit.transliterate(u"working")

'working'