# Text Processing For English

## Prerequisites: Install [SpaCy](https://spacy.io/)
**spaCy** is a relatively new framework in the Python Natural Language Processing environment. There are some really good reasons for its popularity:


*   **It's really FAST**
  * Written in Cython, it was specifically designed to be as fast as possible

*   **It's really ACCURATE**
  * spaCy implementation of its dependency parser is one of the best-performing in the world:
 [ It Depends: Dependency Parser Comparison
Using A Web-based Evaluation Tool](https://aclweb.org/anthology/P/P15/P15-1038.pdf)

* **Batteries included**
  * *Index preserving tokenization*
  * Models for *Part Of Speech tagging, Named Entity Recognition* and *Dependency Parsing*
  * Supports *8 languages* out of the box
  * Easy and *beautiful visualizations*
  * Pretrained *word vectors*
# New section
* **Extensible**
  * It plays nicely with all the other already existing tools that are popular: Scikit-Learn, TensorFlow, gensim

* **DeepLearning Ready**
  * It also has its own deep learning framework that’s especially designed for NLP tasks: [Thinc: A refreshing functional take on deep learning, compatible with your favorite libraries](https://github.com/explosion/thinc)

In [None]:
!python3 -m spacy download en

[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m63.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


`Notice that the installation doesn’t automatically download the English model. We need to do that ourselves. This pipeline for English data, later in this tutorial we will do text processing for hindi dataset.`

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

## Tokenization and Tagging

In [None]:
# Token Text

doc = nlp('Hello     World!')
print("doc contains the string is:", doc)

# Iterate over tokens in a Doc
for token in doc:
    print('"' + token.text + '"')

doc contains the string is: Hello     World!
"Hello"
"    "
"World"
"!"


Notice the index preserving tokenization in action. Rather than only keeping the words, spaCy keeps the spaces too. This is helpful for situations when you need to replace words in the original text or add some annotations. With NLTK tokenization, there’s no way to know exactly where a tokenized word is in the original raw text. spaCy preserves this “link” between the word and its place in the raw text. Here’s how to get the exact index of a word:

The `Token` class exposes a lot of word-level attributes. Here are a few examples:

In [None]:
doc = nlp("Next week I'll   be in India.")
print("text","\tid","\tlemma","\tpunct?","\tspace?","\tshape","\tpos","\ttag","\tdep")
print("--------------------------------------------------------------------------")
for token in doc:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(
        token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_,
        token.tag_
    ))

text 	id 	lemma 	punct? 	space? 	shape 	pos 	tag 	dep
--------------------------------------------------------------------------
Next	0	next	False	False	Xxxx	ADJ	JJ
week	5	week	False	False	xxxx	NOUN	NN
I	10	I	False	False	X	PRON	PRP
'll	11	will	False	False	'xx	AUX	MD
  	15	  	False	True	  	SPACE	_SP
be	17	be	False	False	xx	AUX	VB
in	20	in	False	False	xx	ADP	IN
India	23	India	False	False	Xxxxx	PROPN	NNP
.	28	.	True	False	.	PUNCT	.


In [None]:
sentence = u"It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"

#TODO: Print the tokens and all word level attributes of the sentences.

### Sentence detection
Here’s how to achieve one of the most common NLP tasks with spaCy:

In [None]:
doc = nlp("These are apples. These are oranges.")

for sent in doc.sents:
    print(sent)

These are apples.
These are oranges.


In [None]:
# Sentence Detection

doc = nlp(u"Natural language (NL) refers to the language spoken/written by humans. NL is the primary mode of communication for humans. With the growth of the world wide web, data in the form of text has grown exponentially. It calls for the development of algorithms and techniques for processing natural language for the automation and development of intelligent machines.")

#TODO: Print all the valid sentences

## Part Of Speech Tagging

In [None]:
doc = nlp("Next week I'll be in Madrid.")
print([(token.text, token.tag_) for token in doc])

In [None]:
# POS Tagging

sentence = u"It’s official: Apple is the first U.S. public company to reach a $1 trillion market value."

#TODO: Print the POS of each token.

## Named Entity Recognition
Doing NER with spaCy is super easy and the pretrained model performs pretty well:

In [None]:
doc = nlp("Next week I'll be in Madrid.")
for ent in doc.ents:
    print(ent.text, ent.label_)

Next week DATE
Madrid GPE


### Spacy Entity Types
The spaCy NER also has a healthy variety of entities. You can view the full list here: [Entity Types](https://spacy.io/usage/linguistic-features#entity-types)

In [None]:
doc = nlp(u"I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ")
for ent in doc.ents:
    print(ent.text, ent.label_)

2 CARDINAL
9 a.m. TIME
30% PERCENT
just 2 days DATE
WSJ ORG


Entity text and its label

In [None]:
# Iterate over the doc.ents and print the entity text and label_ attribute.

text = "I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ"

# Process the text
doc = nlp(text)

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(f"{ent.text:<15}{ent.label_:<12}{spacy.explain(ent.label_):<12}")

2              CARDINAL    Numerals that do not fall under another type
9 a.m.         TIME        Times smaller than a day
30%            PERCENT     Percentage, including "%"
just 2 days    DATE        Absolute or relative dates or periods
WSJ            ORG         Companies, agencies, institutions, etc.


You can also view the Inside–outside–beginning (IOB) style tagging of the sentence like this:

In [None]:
from nltk.chunk import conlltags2tree


doc = nlp("Next week I'll be in Madrid.")
iob_tagged = [
    (
        token.text,
        token.tag_,
        "{0}-{1}".format(token.ent_iob_, token.ent_type_) if token.ent_iob_ != 'O' else token.ent_iob_
    ) for token in doc
]

print(iob_tagged)

# In case you like the nltk.Tree format
print(conlltags2tree(iob_tagged))

[('Next', 'JJ', 'B-DATE'), ('week', 'NN', 'I-DATE'), ('I', 'PRP', 'O'), ("'ll", 'MD', 'O'), ('be', 'VB', 'O'), ('in', 'IN', 'O'), ('Madrid', 'NNP', 'B-GPE'), ('.', '.', 'O')]
(S
  (DATE Next/JJ week/NN)
  I/PRP
  'll/MD
  be/VB
  in/IN
  (GPE Madrid/NNP)
  ./.)


In [None]:
# NER Named Entity Recognition

sentence = u"A UN review of national plans to cut carbon says they are well short of the levels needed to keep the rise in global temperatures under 2C. Many scientists say that technology to remove carbon from the air will now be needed to meet the Paris targets."

#TODO: Print the entities along with their labels that are found in the sentence.

In [None]:
# Print the entity text, label attribute and explains the lables for each entity.

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value. I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ"


In [None]:
# Print the entity text, label attribute in the IOB style tagging.

text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value. I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ"


Let’s use `displaCy` to view a beautiful visualization of the Named Entity annotated sentence:

In [None]:
# displaCy

from spacy import displacy

doc = nlp(u'I just bought 2 shares at 9 a.m. because the stock went up 30% in just 2 days according to the WSJ')
displacy.render(doc, style='ent', jupyter=True)

In [None]:
# displaCy

sentence = u"A UN review of national plans to cut carbon says they are well short of the levels needed to keep the rise in global temperatures under 2C. Many scientists say that technology to remove carbon from the air will now be needed to meet the Paris targets."

#TODO: Visualize the Named Entity annotated sentence using displayCy.

### Chunking
spaCy automatically detects noun-phrases as well:

In [None]:
# Chunking
doc = nlp("Wall Street Journal just published an interesting piece on crypto currencies")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.label_, chunk.root.text)

Wall Street Journal NP Journal
an interesting piece NP piece
crypto currencies NP currencies


Notice how the chunker also computes the root of the phrase, the main word of the phrase.

## Dependency Parsing

In [None]:
# Dependency Parsing

doc = nlp(u'Wall Street Journal just published an interesting piece on crypto currencies')

for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

Wall/NNP <--compound-- Street/NNP
Street/NNP <--compound-- Journal/NNP
Journal/NNP <--nsubj-- published/VBD
just/RB <--advmod-- published/VBD
published/VBD <--ROOT-- published/VBD
an/DT <--det-- piece/NN
interesting/JJ <--amod-- piece/NN
piece/NN <--dobj-- published/VBD
on/IN <--prep-- piece/NN
crypto/JJ <--amod-- currencies/NNS
currencies/NNS <--pobj-- on/IN


If this doesn’t help visualizing the dependency tree, displaCy comes in handy:

In [None]:
# Visualizing Dependency Parsing

from spacy import displacy

doc = nlp(u'Wall Street Journal just published an interesting piece on crypto currencies')

displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})
for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

Wall/NNP <--compound-- Street/NNP
Street/NNP <--compound-- Journal/NNP
Journal/NNP <--nsubj-- published/VBD
just/RB <--advmod-- published/VBD
published/VBD <--ROOT-- published/VBD
an/DT <--det-- piece/NN
interesting/JJ <--amod-- piece/NN
piece/NN <--dobj-- published/VBD
on/IN <--prep-- piece/NN
crypto/JJ <--amod-- currencies/NNS
currencies/NNS <--pobj-- on/IN


In [None]:
sentence = u'It took me more than two hours to translate a few pages of English.'

#TODO:  Detect all Noun-phrases in the sentence.

In [None]:
sentence = u'It took me more than two hours to translate a few pages of English.'

#TODO:  Print Dependency Parsing.

In [None]:
sentence = u'It took me more than two hours to translate a few pages of English.'

#TODO:  Visualizing Dependency Parsing.

#You can also try:
  # 1. Print the headwords and their dependencies
  # 2. Print the Head/Root of a sentence
  # 3. Print the ancestor of Translate
  # 4. Print the childrens of piece

# Text Processing For Hindi

Currently trained pipeline is not available in Spacy. So we are going to use [Indic NLP Library](https://anoopkunchukuttan.github.io/indic_nlp_library/) Resources and tools for Indian language Natural Language Processing.

The goal of the Indic NLP Library is to build Python based libraries for common text processing and Natural Language Processing in Indian languages. Indian languages share a lot of similarity in terms of script, phonology, language syntax, etc. and this library is an attempt to provide a general solution to very commonly required toolsets for Indian language text. But in this tutorial our main focus will be on Hindi only.

The library provides the following functionalities:

* Text Normalization
* Script Information
* Tokenization
* Word Segmentation
* Script Conversion
* Romanization
* Indicization
* Transliteration

## Pre-requisites

- Python 3.5+
- [Morfessor 2.0 Python Library](http://www.cis.hut.fi/projects/morpho/morfessor2.shtml)

## Language Support
* Indo-Aryan
  * Assamese -- asm
  * Bengali -- ben
  * Gujrati -- guj
  * Hindi -- hin
  * Urdu  -- urd
  * Marathi -- mar
  * Nepali -- nep
  * Odia -- ori
  * Punjabi -- pan
  * Sindhi -- sin
  * Sanskrit -- san
  * Konkani -- kok

* Dravidian
  * Kannada -- kan
  * Malayalam -- mal
  * Telgu -- tel
  * Tamil -- tam

* Others
  * English -- eng


In [None]:
!git clone "https://github.com/anoopkunchukuttan/indic_nlp_library"

Cloning into 'indic_nlp_library'...
remote: Enumerating objects: 1404, done.[K
remote: Counting objects: 100% (185/185), done.[K
remote: Compressing objects: 100% (58/58), done.[K
remote: Total 1404 (delta 139), reused 152 (delta 124), pack-reused 1219 (from 1)[K
Receiving objects: 100% (1404/1404), 9.57 MiB | 11.55 MiB/s, done.
Resolving deltas: 100% (749/749), done.


In [None]:
!git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git

Cloning into 'indic_nlp_resources'...
remote: Enumerating objects: 139, done.[K
remote: Counting objects: 100% (13/13), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 139 (delta 2), reused 2 (delta 0), pack-reused 126 (from 1)[K
Receiving objects: 100% (139/139), 149.77 MiB | 32.75 MiB/s, done.
Resolving deltas: 100% (53/53), done.
Updating files: 100% (28/28), done.


In [None]:
!pip install Morfessor

Collecting Morfessor
  Downloading Morfessor-2.0.6-py3-none-any.whl.metadata (628 bytes)
Downloading Morfessor-2.0.6-py3-none-any.whl (35 kB)
Installing collected packages: Morfessor
Successfully installed Morfessor-2.0.6


**----- Set these variables -----**

In [None]:
# The path to the local git repo for Indic NLP library
INDIC_NLP_LIB_HOME=r"/content/indic_nlp_library"

# The path to the local git repo for Indic NLP Resources
INDIC_NLP_RESOURCES="/content/indic_nlp_resources"

**Add Library to Python path**

In [None]:
import sys
sys.path.append(r'{}'.format(INDIC_NLP_LIB_HOME))

**Set environment variable**

In [None]:
from indicnlp import common
common.set_resources_path(INDIC_NLP_RESOURCES)

**Initialize the Indic NLP library**

In [None]:
from indicnlp import loader
loader.load()

## Text Normalization

Text written in Indic scripts display a lot of quirky behaviour on account of varying input methods, multiple representations for the same character, etc.
There is a need to canonicalize the representation of text so that NLP applications can handle the data in a consistent manner. The canonicalization primarily handles the following issues:

    - Non-spacing characters like ZWJ/ZWNL
    - Multiple representations of Nukta based characters
    - Multiple representations of two part dependent vowel signs
    - Typing inconsistencies: e.g. use of pipe (|) for poorna virama

When data available is scarce, such normalization can help utilize the data more efficiently.

In [None]:
# from indicnlp.normalize.indic_normalize import IndicNormalizerFactory
from indicnlp.normalize.indic_normalize import BaseNormalizer

input_text="\u0958 \u0915\u093c"
remove_nuktas=False
# factory=IndicNormalizerFactory()
normalizer = BaseNormalizer("hi", remove_nuktas=False)
# normalizer=factory.get_normalizer("hi",remove_nuktas)
output_text=normalizer.normalize(input_text)

print(input_text)
print()

print('Before normalization')
print(' '.join([ hex(ord(c)) for c in input_text ] ))
print('Length: {}'.format(len(input_text)))
print()
print('After normalization')
print(' '.join([ hex(ord(c)) for c in output_text ] ))
print('Length: {}'.format(len(output_text)))


क़ क़

Before normalization
0x958 0x20 0x915 0x93c
Length: 4

After normalization
0x958 0x20 0x915 0x93c
Length: 4


## Sentence Splitter

A smart sentence splitter which uses a two-pass rule-based system to split the text into sentences. It knows of common prefixes in Indian languages.

In [None]:
from indicnlp.tokenize import sentence_tokenize

indic_string="""तो क्या विश्व कप 2019 में मैच का बॉस टॉस है? यानी मैच में हार-जीत में \
टॉस की भूमिका अहम है? आप ऐसा सोच सकते हैं। विश्वकप के अपने-अपने पहले मैच में बुरी तरह हारने वाली एशिया की दो टीमों \
पाकिस्तान और श्रीलंका के कप्तान ने हालांकि अपने हार के पीछे टॉस की दलील तो नहीं दी, लेकिन यह जरूर कहा था कि वह एक अहम टॉस हार गए थे।"""
sentences=sentence_tokenize.sentence_split(indic_string, lang='hi')
for t in sentences:
    print(t)

तो क्या विश्व कप 2019 में मैच का बॉस टॉस है?
यानी मैच में हार-जीत में टॉस की भूमिका अहम है?
आप ऐसा सोच सकते हैं।
विश्वकप के अपने-अपने पहले मैच में बुरी तरह हारने वाली एशिया की दो टीमों पाकिस्तान और श्रीलंका के कप्तान ने हालांकि अपने हार के पीछे टॉस की दलील तो नहीं दी, लेकिन यह जरूर कहा था कि वह एक अहम टॉस हार गए थे।


## Tokenization

A trivial tokenizer which just tokenizes on the punctuation boundaries. This also includes punctuations for the Indian language scripts (the purna virama and the deergha virama). It returns a list of tokens.   


In [None]:
from indicnlp.tokenize import indic_tokenize

indic_string='सुनो, कुछ आवाज़ आ रही है। फोन?'

print('Input String: {}'.format(indic_string))
print('Tokens: ')
for t in indic_tokenize.trivial_tokenize(indic_string):
    print(t)

Input String: सुनो, कुछ आवाज़ आ रही है। फोन?
Tokens: 
सुनो
,
कुछ
आवाज़
आ
रही
है
।
फोन
?


## De-tokenization

A de-tokenizer for Indian languages that can address punctuation in Indic languages. The de-tokenizer is useful when generating natural language output. It can be used as a post-processor.


In [None]:
from indicnlp.tokenize import indic_detokenize
indic_string='" सुनो , कुछ आवाज़ आ रही है . " , उसने कहा । '

print('Input String: {}'.format(indic_string))
print('Detokenized String: {}'.format(indic_detokenize.trivial_detokenize(indic_string,lang='hi')))


Input String: " सुनो , कुछ आवाज़ आ रही है . " , उसने कहा । 
Detokenized String: "सुनो, कुछ आवाज़ आ रही है.", उसने कहा। 


## Script Conversion

Convert from one Indic script to another. This is a simple script which exploits the fact that Unicode points of various Indic scripts are at corresponding offsets from the base codepoint for that script. The following scripts are supported:

_Devanagari (Hindi,Marathi,Sanskrit,Konkani,Sindhi,Nepali), Assamese, Bengali, Oriya, Gujarati, Gurumukhi (Punjabi), Sindhi, Tamil, Telugu, Kannada, Malayalam_

In [None]:
from indicnlp.transliterate.unicode_transliterate import UnicodeIndicTransliterator
input_text='राजस्थान'
# input_text='രാജസ്ഥാന'
# input_text='රාජස්ථාන'
print(UnicodeIndicTransliterator.transliterate(input_text,"hi","ta"))

ராஜஸ்தாந


## Romanization

Convert script text to Roman text in the ITRANS notation

In [None]:
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator

input_text='राजस्थान'
# input_text='ஆசிரியர்கள்'
lang='hi'

print(ItransTransliterator.to_itrans(input_text,lang))

raajasthaana


## Indicization (ITRANS to Indic Script)

Let's call conversion of ITRANS-transliteration to an Indic script as **Indicization**!


In [None]:
from indicnlp.transliterate.unicode_transliterate import ItransTransliterator

input_text='kahaa jaanaa hai?'
lang='hi'
x=ItransTransliterator.from_itrans(input_text,lang)
print(x)
for y in x:
    print('{:x}'.format(ord(y)))

कहा जाना है?
915
939
93e
20
91c
93e
928
93e
20
939
948
3f


## Script Information

Indic scripts have been designed keeping phonetic principles in nature and the design and organization of the scripts makes it easy to obtain phonetic information about the characters.

### Get Phonetic Feature Vector

With each script character, a phontic feature vector is associated, which encodes the phontic properties of the character. This is a bit vector which is can be obtained as shown below:  

In [None]:
from indicnlp.script import  indic_scripts as isc

c='क'
lang='hi'

isc.get_phonetic_feature_vector(c,lang)

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0])

This fields in this bit vector are (from left to right):

In [None]:
sorted(isc.PV_PROP_RANGES.items(),key=lambda x:x[1][0])

[('basic_type', [0, 6]),
 ('vowel_length', [6, 8]),
 ('vowel_strength', [8, 11]),
 ('vowel_status', [11, 13]),
 ('consonant_type', [13, 18]),
 ('articulation_place', [18, 23]),
 ('aspiration', [23, 25]),
 ('voicing', [25, 27]),
 ('nasalization', [27, 29]),
 ('vowel_horizontal', [29, 32]),
 ('vowel_vertical', [32, 36]),
 ('vowel_roundness', [36, 38])]

You can check the phonetic information database files in Indic NLP resources to know the definition of each of the bits.

- _For Tamil Script_: [database](https://github.com/anoopkunchukuttan/indic_nlp_resources/blob/master/script/tamil_script_phonetic_data.csv)
- _For other Indic Scripts_: [database](https://github.com/anoopkunchukuttan/indic_nlp_resources/blob/master/script/all_script_phonetic_data.csv)

### Query Phonetic Properties

**Note:** _This interface below will be deprecated soon and a new interface will be available soon._

In [None]:
from indicnlp.langinfo import *

c='क'
lang='hi'

print('Is vowel?:  {}'.format(is_vowel(c,lang)))
print('Is consonant?:  {}'.format(is_consonant(c,lang)))
print('Is velar?:  {}'.format(is_velar(c,lang)))
print('Is palatal?:  {}'.format(is_palatal(c,lang)))
print('Is aspirated?:  {}'.format(is_aspirated(c,lang)))
print('Is unvoiced?:  {}'.format(is_unvoiced(c,lang)))
print('Is nasal?:  {}'.format(is_nasal(c,lang)))

Is vowel?:  False
Is consonant?:  True
Is velar?:  True
Is palatal?:  False
Is aspirated?:  False
Is unvoiced?:  True
Is nasal?:  False


### Get Phonetic Similarity

Using the phonetic feature vectors, we can define phonetic similarity between the characters (and underlying phonemes). The library implements some  measures for phonetic similarity between the characters (and underlying phonemes). These can be defined using the phonetic feature vectors discussed earlier, so users can implement additional similarity measures.

The implemented similarity measures are:

- cosine
- dice
- jaccard
- dot_product
- sim1 (Kunchukuttan _et al._, 2016)
- softmax

** References **

Anoop Kunchukuttan, Pushpak Bhattacharyya, Mitesh Khapra. _Substring-based unsupervised transliteration with phonetic and contextual knowledge_. SIGNLL Conference on Computational Natural Language Learning ** (CoNLL 2016) **. 2016.

In [None]:
from indicnlp.script import  indic_scripts as isc
from indicnlp.script import  phonetic_sim as psim

c1='क'
c2='ख'
c3='भ'
lang='hi'

print('Similarity between {} and {}'.format(c1,c2))
print(psim.cosine(
    isc.get_phonetic_feature_vector(c1,lang),
    isc.get_phonetic_feature_vector(c2,lang)
    ))

print()

print(u'Similarity between {} and {}'.format(c1,c3))
print(psim.cosine(
    isc.get_phonetic_feature_vector(c1,lang),
    isc.get_phonetic_feature_vector(c3,lang)
    ))


Similarity between क and ख
0.8333319444467593

Similarity between क and भ
0.4999991666680556


_You may have figured out that you can also compute similarities of characters belonging to different scripts._

You can also get a similarity matrix which contains the similarities between all pairs of characters (within the same script or across scripts).

Let's see how we can compare the characters across Devanagari and Malayalam scripts

In [None]:
from indicnlp.script import  indic_scripts as isc
from indicnlp.script import  phonetic_sim as psim


slang='hi'
tlang='ml'
sim_mat=psim.create_similarity_matrix(psim.cosine,slang,tlang,normalize=False)

c1='क'
c2='ഖ'
print('Similarity between {} and {}'.format(c1,c2))
print(sim_mat[isc.get_offset(c1,slang),isc.get_offset(c2,tlang)])

Similarity between क and ഖ
0.8333319444467593


Some similarity functions like `sim` do not generate values in the range [0,1] and it may be more convenient to have the similarity values in the range [0,1]. This can be achieved by setting the `normalize` paramter to `True`

In [None]:
slang='hi'
tlang='ml'
sim_mat=psim.create_similarity_matrix(psim.sim1,slang,tlang,normalize=True)

c1='क'
c2='ഖ'
print(u'Similarity between {} and {}'.format(c1,c2))
print(sim_mat[isc.get_offset(c1,slang),isc.get_offset(c2,tlang)])

Similarity between क and ഖ
0.06860894001932027


### Lexical Similarity

In [None]:
from indicnlp.script import  indic_scripts as isc
from indicnlp.transliterate.unicode_transliterate import UnicodeIndicTransliterator

lang1_str='पिछले दिनों हम लोगों ने कई उत्सव मनाये. कल, हिन्दुस्तान भर में श्री कृष्ण जन्म-महोत्सव मनाया गया.'
lang2_str='વીતેલા દિવસોમાં આપણે કેટલાય ઉત્સવો ઉજવ્યા. હજી ગઇકાલે જ પૂરા હિંદુસ્તાનમાં શ્રીકૃષ્ણ જન્મોત્સવ ઉજવવામાં આવ્યો.'
lang1='hi'
lang2='gu'

lcsr, len1, len2 = isc.lcsr_indic(lang1_str,lang2_str,lang1,lang2)

print('{} string: {}'.format(lang1, lang1_str))
print('{} string: {}'.format(lang2, UnicodeIndicTransliterator.transliterate(lang2_str,lang2,lang1)))
print('Both strings are shown in Devanagari script using script conversion for readability.')
print('LCSR: {}'.format(lcsr))


hi string: पिछले दिनों हम लोगों ने कई उत्सव मनाये. कल, हिन्दुस्तान भर में श्री कृष्ण जन्म-महोत्सव मनाया गया.
gu string: वीतेला दिवसोमां आपणे केटलाय उत्सवो उजव्या. हजी गइकाले ज पूरा हिंदुस्तानमां श्रीकृष्ण जन्मोत्सव उजववामां आव्यो.
Both strings are shown in Devanagari script using script conversion for readability.
LCSR: 0.5545454545454546


## Orthographic Syllabification

_Orthographic Syllabification_ is an approximate syllabification process for Indic scripts, where CV+ units are defined to be _orthographic syllables_.

See the following paper for details:

Anoop Kunchukuttan, Pushpak Bhattacharyya. [_Orthographic Syllable as basic unit for SMT between Related Languages_](https://arxiv.org/abs/1610.00634). Conference on Empirical Methods in Natural Language Processing **(EMNLP 2016)**. 2016.

In [None]:
from indicnlp.syllable import  syllabifier

w='जगदीशचंद्र'
lang='hi'

print(' '.join(syllabifier.orthographic_syllabify(w,lang)))

ज ग दी श च ंद्र


## Word Segmentation

Unsupervised morphological analysers for various Indian language. Given a word, the analyzer returns the componenent morphemes.
The analyzer can recognize inflectional and derivational morphemes.

The following languages are supported:

_Hindi, Punjabi, Marathi, Konkani, Gujarati, Bengali, Kannada, Tamil, Telugu, Malayalam_

Support for more languages will be added soon.

In [None]:
from indicnlp.morph import unsupervised_morph
from indicnlp import common

analyzer=unsupervised_morph.UnsupervisedMorphAnalyzer('mr')

In [None]:
indic_string='आपल्या हिरड्यांच्या आणि दातांच्यामध्ये जीवाणू असतात .'

analyzes_tokens=analyzer.morph_analyze_document(indic_string.split(' '))

for w in analyzes_tokens:
    print(w)

आपल्या
हिरड्या
ंच्या
आणि
दाता
ंच्या
मध्ये
जीवाणू
असतात
.


### Acronyms

Acronyms have a different behaviour while transliterating. Hence, a rule-based transliterator for transliterating English acronyms to Indian languages is available.

This can also be used to generate synthetic transliteration data to train a Indian language to English transliterator for acronyms.

In [None]:
from indicnlp.transliterate import acronym_transliterator

ack_transliterator=acronym_transliterator.LatinToIndicAcronymTransliterator()
ack_transliterator.transliterate('ICICI',lang='hi')

'आईसीआईसीआई'

# Regular Expresion(RE)


A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern.
Tutorials Taken from [py4e](https://www.py4e.com/html3/11-regex), [w3schools](https://www.w3schools.com/python/python_regex.asp) and [w3resource](https://www.w3resource.com/python-exercises/re/).

The regular expression library re must be imported into your program before you can use it.

In [None]:
import re

**Here are some of those special characters and character sequences:**

`^` Matches the beginning of the line.

`$` Matches the end of the line.

`.` Matches any character (a wildcard).

`\s` Matches a whitespace character.

`\S` Matches a non-whitespace character (opposite of \s).

`*` Applies to the immediately preceding character(s) and indicates to match zero or more times.

`*?` Applies to the immediately preceding character(s) and indicates to match zero or more times in “non-greedy mode”.

`+` Applies to the immediately preceding character(s) and indicates to match one or more times.

`|`	Either or

`+?` Applies to the immediately preceding character(s) and indicates to match one or more times in “non-greedy mode”.

`?` Applies to the immediately preceding character(s) and indicates to match zero or one time.

`??` Applies to the immediately preceding character(s) and indicates to match zero or one time in “non-greedy mode”.

`[aeiou]` Matches a single character as long as that character is in the specified set. In this example, it would match “a”, “e”, “i”, “o”, or “u”, but no other characters.

`[a-z0-9]` You can specify ranges of characters using the minus sign. This example is a single character that must be a lowercase letter or a digit.

`[^A-Za-z]` When the first character in the set notation is a caret, it inverts the logic. This example matches a single character that is anything other than an uppercase or lowercase letter.

`( )` When parentheses are added to a regular expression, they are ignored for the purpose of matching, but allow you to extract a particular subset of the matched string rather than the whole string when using findall().

`\b` Matches the empty string, but only at the start or end of a word.

`\B` Matches the empty string, but not at the start or end of a word.

`\d` Matches any decimal digit; equivalent to the set [0-9].

`\D` Matches any non-digit character; equivalent to the set [^0-9].

## RegEx Functions
The `re` module offers a set of functions that allows us to search a string for a match:

Function  | Description
------------- | -------------
findall  | Returns a list containing all matches
search  | Returns a Match object if there is a match anywhere in the string
split  | Returns a list where the string has been split at each match
sub  | Replaces one or many matches with a string


### The findall() Function
The `findall()` function returns a list containing all matches.

In [None]:
txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

['ai', 'ai']


The list contains the matches in the order they are found.

If no matches are found, an empty list is returned:

In [None]:
# Return an empty list if no match was found:
txt = "The rain in Spain"
x = re.findall("Portugal", txt)
print(x)

[]


### The search() Function
The `search()` function searches the string for a match, and returns a Match object if there is a match.

If there is more than one match, only the first occurrence of the match will be returned:

In [None]:
#Search for the first white-space character in the string:

txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

The first white-space character is located in position: 3


If no matches are found, the value `None` is returned:

In [None]:
# Make a search that returns no match:
txt = "The rain in Spain"
x = re.search("Portugal", txt)
print(x)

None


### The split() function
The `split()` function returns a list where the string has been split at each match:



In [None]:
# Split at each white-space character:

txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

['The', 'rain', 'in', 'Spain']


You can control the number of occurrences by specifying the maxsplit parameter:

In [None]:
txt = "The rain in Spain"
x = re.split("\s", txt, 1)
print(x)

['The', 'rain in Spain']


### The sub() Function
The `sub()` function replaces the matches with the text of your choice:

In [None]:
# Replace every white-space character with the number 9:

txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)

The9rain9in9Spain


You can control the number of replacements by specifying the count parameter:

In [None]:
#Replace the first 2 occurrences:
txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)
print(x)

The9rain9in Spain


### Match Object
A Match Object is an object containing information about the search and the result.

Note: If there is no match, the value `None` will be returned, instead of the Match Object.

In [None]:
# Do a search that will return a Match Object:
txt = "The rain in Spain"
x = re.search("ai", txt)
print(x) #this will print an object

<re.Match object; span=(5, 7), match='ai'>


The Match object has properties and methods used to retrieve information about the search, and the result:

`.span()` returns a tuple containing the start-, and end positions of the match.

`.string` returns the string passed into the function

`.group()` returns the part of the string where there was a match

**Example of span() Function:**

In [None]:
# Print the position (start- and end-position) of the first match occurrence.

# The regular expression looks for any words that starts with an upper case "S":

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.span())

(12, 17)


**Example of string() Function:**

In [None]:
# Print the string passed into the function:
txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.string)

The rain in Spain


**Example of group() Function:**

In [None]:
# Print the part of the string where there was a match.

# The regular expression looks for any words that starts with an upper case "S":

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.group())

Spain


### Remove all whitespaces from a string

In [None]:
text1 = ' Regular   Expresion '
print("Original string:",text1)
print("Without extra spaces:",re.sub(r'\s+', '',text1))

Original string:  Regular   Expresion 
Without extra spaces: RegularExpresion


In [None]:
# TODO: Remove all whitespaces from a string.

text2 = '       Regular   Expresion '

### Remove multiple spaces in a string

In [None]:
text1 = '        Regular   Expresion '
print("Original string:",text1)
print("Without extra spaces:",re.sub(' +',' ',text1))

Original string:         Regular   Expresion 
Without extra spaces:  Regular Expresion 


## Extracting data using regular expressions

### Extract values between quotation marks of a string

In [None]:
text1 = '"Natural", "Language", "Processing"'
print(re.findall(r'"(.*?)"', text1))

['Natural', 'Language', 'Processing']


### Escape character

Since we use special characters in regular expressions to match the beginning or end of a line or specify wild cards, we need a way to indicate that these characters are “normal” and we want to match the actual character such as a dollar sign or caret.

We can indicate that we want to simply match a character by prefixing that character with a backslash. For example, we can find money amounts with the following regular expression.

In [None]:
x = 'We just received $10.00 for cookies.'
y = re.findall('\$[0-9.]+',x)
print(y)

['$10.00']


Since we prefix the dollar sign with a backslash, it actually matches the dollar sign in the input string instead of matching the “end of line”, and the rest of the regular expression matches one or more digits or the period character.

Note: Inside square brackets, characters are not “special”. So when we say [0-9.], it really means digits or a period. Outside of square brackets, a period is the “wild-card” character and matches any character. Inside square brackets, the period is a period.

### Extract domain ('gmail.com') from the Email Address.

In [None]:
sentence = "From nlp.course.iitk@gmail.com Sat September  5 09:14:16 2020"
domain = re.findall("@(\S+)", sentence)
print(domain)

['gmail.com']


### Extract the Email address

In [None]:
sentence = "From nlp.course.iitk@gmail.com Sat September  5 09:14:16 2020"
y = re.findall("\S+?@\S+", sentence)
print(y)

['nlp.course.iitk@gmail.com']


### Remove leading zeros from an IP address

In [None]:
ip = "216.08.094.196"
string = re.sub('\.[0]*', '.', ip)
print(string)

216.8.94.196


### Remove lowercase substrings from a given string

In [None]:
str1 = 'CS698V/CS779: Statistical NATURAL LANGUAGE PROCESSING (NLP). Parsing: CFG, Lexicalized CFG, PCFGs, Dependency parsing'
print("Original string:")
print(str1)
print("After removing lowercase letters, above string becomes:")
remove_lower = lambda text: re.sub('[a-z]', '', text)
result =  remove_lower(str1)
print(result)

Original string:
CS698V/CS779: Statistical NATURAL LANGUAGE PROCESSING (NLP). Parsing: CFG, Lexicalized CFG, PCFGs, Dependency parsing
After removing lowercase letters, above string becomes:
CS698V/CS779: S NATURAL LANGUAGE PROCESSING (NLP). P: CFG, L CFG, PCFG, D 


### Insert spaces between words starting with capital letters

In [None]:
str1 = "NaturalLanguageProcessing"
string = re.sub(r"(\w)([A-Z])", r"\1 \2", str1)
print(string)

Natural Language Processing


### Remove the parenthesis area in a string

In [None]:
items = ["Natural Language Processing (NLP)", "IITK", "gmail (.com)", "CS (779)"]

for item in items:
    print(re.sub(r" ?\([^)]+\)", "", item))

Natural Language Processing
IITK
gmail
CS


### Remove the ANSI escape sequences from a string

In [None]:
text = "\t\u001b[0;35mCS779\u001b[0m \u001b[0;36mStatistical Natural Language Processing\u001b[0m"
print("Original Text: ",text)
reaesc = re.compile(r'\x1b[^m]*m')
new_text = reaesc.sub('', text)
print("New Text: ",new_text)

Original Text:  	[0;35mCS779[0m [0;36mStatistical Natural Language Processing[0m
New Text:  	CS779 Statistical Natural Language Processing


## Download a sample text file which records mail activity from various individuals in an open source project development team.

In [None]:
!wget "http://www.py4inf.com/code/mbox-short.txt"

--2025-02-03 18:28:47--  http://www.py4inf.com/code/mbox-short.txt
Resolving www.py4inf.com (www.py4inf.com)... 74.208.236.248, 2607:f1c0:100f:f000::2df
Connecting to www.py4inf.com (www.py4inf.com)|74.208.236.248|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.dr-chuck.com/py4inf/code/mbox-short.txt [following]
--2025-02-03 18:28:47--  https://www.dr-chuck.com/py4inf/code/mbox-short.txt
Resolving www.dr-chuck.com (www.dr-chuck.com)... 74.208.236.248, 2607:f1c0:100f:f000::2df
Connecting to www.dr-chuck.com (www.dr-chuck.com)|74.208.236.248|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 94626 (92K) [text/plain]
Saving to: ‘mbox-short.txt’


2025-02-03 18:28:48 (1.72 MB/s) - ‘mbox-short.txt’ saved [94626/94626]



In [None]:
hand = open('/content/mbox-short.txt')

### Search for lines that start with From and have an at sign

In [None]:
for line in hand:
    line = line.rstrip()
    if re.search('^From:.+@', line):
        print(line)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


In [None]:
# TODO:
# Search for lines that start with 'X' followed by any non
# whitespace characters and ':'
# followed by a space and any number.
# The number can include a decimal.

In [None]:
# TODO:
# Search for lines that start with 'X' followed by any
# non whitespace characters and ':' followed by a space
# and any number. The number can include a decimal.
# Then print the number if it is greater than zero.

In [None]:
# TODO:
# Search for lines that start with 'Details: rev='
# followed by numbers and '.'
# Then print the number if it is greater than zero

In [None]:
# TODO:
# Search for lines that start with From and a character
# followed by a two digit number between 00 and 99 followed by ':'
# Then print the number if it is greater than zero