# Work Package 1 : starting with language datas

## Tokenization

To perform tokenization, we can import the sentence tokenization function. The argument of this function will be text that needs to be tokenized. The sent_tokenize function uses an instance of NLTK known as PunktSentenceTokenizer. This instance of NLTK has already been trained to perform tokenization on different European languages on the basis of letters or punctuation that mark the beginning and end of sentences.

In [1]:
import nltk

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
text=" Hello everyone. Hope all are fine and doing well. Hope you find the book interesting"
tokenizer.tokenize(text)

[' Hello everyone.',
 'Hope all are fine and doing well.',
 'Hope you find the book interesting']

In [4]:
french_tokenizer=nltk.data.load('tokenizers/punkt/french.pickle')
french_tokenizer.tokenize("Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage collège franco-britanniquedeLevallois-Perret. Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage Levallois. L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, janvier , d'un professeur d'histoire. L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, mercredi , d'un professeur d'histoire")

['Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage collège franco-britanniquedeLevallois-Perret.',
 'Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage Levallois.',
 "L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, janvier , d'un professeur d'histoire.",
 "L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, mercredi , d'un professeur d'histoire"]

#### Tokenization of text in other languages :

For performing tokenization in languages other than English, we can load the respective language
pickle file found in tokenizers/punkt and then tokenize the text in another language, which is an
argument of the tokenize() function. For the tokenization of French text, we will use the
french.pickle file as follows:

In [6]:
french_tokenizer.tokenize("Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage collège franco-britanniquedeLevallois-Perret. Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage Levallois. L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, janvier , d'un professeur d'histoire. L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, mercredi , d'un professeur d'histoire")

['Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage collège franco-britanniquedeLevallois-Perret.',
 'Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage Levallois.',
 "L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, janvier , d'un professeur d'histoire.",
 "L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, mercredi , d'un professeur d'histoire"]

#### Tokenization of sentences into words :

Now, we'll perform processing on individual sentences. Individual sentences are tokenized into
words. Word tokenization is performed using a word_tokenize() function. The word_tokenize
function uses an instance of NLTK known as TreebankWordTokenizer to perform word
tokenization.

In [7]:
from nltk.tokenize import TreebankWordTokenizer

In [8]:
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize("Have a nice day. I hope you find the book interesting")

['Have',
 'a',
 'nice',
 'day.',
 'I',
 'hope',
 'you',
 'find',
 'the',
 'book',
 'interesting']

TreebankWordTokenizer uses conventions according to Penn Treebank Corpus. It works by
separating contractions. This is shown here:

In [9]:
text=nltk.word_tokenize(" Don't hesitate to ask questions")
print(text)

['Do', "n't", 'hesitate', 'to', 'ask', 'questions']


Another word tokenizer is PunktWordTokenizer . It works by splitting punctuation; each word is kept instead of creating an entirely new token. Another word tokenizer is WordPunctTokenizer . It
provides splitting by making punctuation an entirely new token. This type of splitting is usually
desirable:

In [11]:
from nltk.tokenize import WordPunctTokenizer

In [12]:

tokenizer=WordPunctTokenizer()
tokenizer.tokenize(" Don't hesitate to ask questions")

['Don', "'", 't', 'hesitate', 'to', 'ask', 'questions']

#### Tokenization using regular expressions(regex)

The tokenization of words can be performed by constructing regular expressions in these two ways:

• By matching with words

• By matching spaces or gaps

We can import RegexpTokenizer from NLTK. We can create a Regular Expression that can match
the tokens present in the text:

In [13]:
from nltk.tokenize import RegexpTokenizer

In [14]:
tokenizer=RegexpTokenizer("[\w]+")
tokenizer.tokenize("Don't hesitate to ask questions")

['Don', 't', 'hesitate', 'to', 'ask', 'questions']

Instead of instantiating class, an alternative way of tokenization would be to use this function:

In [15]:
from nltk.tokenize import regexp_tokenize

In [16]:
sent="Don't hesitate to ask questions"
print(regexp_tokenize(sent, pattern='\w+|\$[\d\.]+|\S+'))

['Don', "'t", 'hesitate', 'to', 'ask', 'questions']


#### Conversion into lowercase and uppercase :

In [17]:
text='HARdWork IS KEy to SUCCESS'
print(text.lower())
print(text.upper())

hardwork is key to success
HARDWORK IS KEY TO SUCCESS


#### Dealing with stop words :

NLTK has a list of stop words for many languages. We need to unzip datafile so
that the list of stop words can be accessed from nltk_data/corpora/stopwords/ :

In [20]:
from nltk.corpus import stopwords

In [22]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [23]:
stops=set(stopwords.words('english'))
words=["Don't", 'hesitate','to','ask','questions']
[word for word in words if word not in stops]

["Don't", 'hesitate', 'ask', 'questions']

The instance of nltk.corpus.reader.WordListCorpusReader is a stopwords corpus. It has the words()
function, whose argument is fileid . Here, it is English; this refers to all the stop words present in the
English file. If the words() function has no argument, then it will refer to all the stop words of all
the languages. Other languages in which stop word removal can be done, or the number of
languages whose file of stop words is present in NLTK can be found using the fileids() function:

In [24]:
stopwords.fileids()

['arabic',
 'azerbaijani',
 'basque',
 'bengali',
 'catalan',
 'chinese',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hebrew',
 'hinglish',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

## Lemmatization

Lemmatization is the process in which we transform the word into a form with a different word
category. The word formed after lemmatization is entirely different. The built-in morphy() function
is used for lemmatization in WordNetLemmatizer. The inputted word is left unchanged if it is not
found in WordNet. In the argument, pos refers to the part of speech category of the inputted word.
Consider an example of lemmatization in NLTK:

In [29]:
from nltk.stem import WordNetLemmatizer

In [32]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [33]:
lemmatizer_output=WordNetLemmatizer()
lemmatizer_output.lemmatize('working')

'working'

In [34]:
lemmatizer_output.lemmatize('works')

'work'

The WordNetLemmatizer library may be defined as a wrapper around the so-called WordNet
corpus, and it makes use of the morphy() function present in WordNetCorpusReader to extract a
lemma. If no lemma is extracted, then the word is only returned in its original form. For example,
for works , the lemma returned is the singular form, work .
Let's consider the following code that illustrates the difference between stemming and
lemmatization :

In [35]:
from nltk.stem import PorterStemmer

In [38]:
stemmer_output=PorterStemmer()
stemmer_output.stem('happiness')

'happi'

In [40]:
from nltk.stem import WordNetLemmatizer

In [41]:
lemmatizer_output=WordNetLemmatizer()
lemmatizer_output.lemmatize('happiness')

'happiness'

In the preceding code, happiness is converted to happi by stemming.
Lemmatization doesn't find the root word for happiness , so it returns the word
happiness.

# Similarity measure

In [42]:
from nltk.metrics import *

In [43]:
edit_distance("relate","relation")

3

In [44]:
edit_distance("suggestion","calculation")

7

Applying similarity measures using Jaccard's Coefficient.
Jaccard's coefficient, or Tanimoto coefficient, may be defined as a measure of the overlap of two
sets, X and Y.

It may be defined as follows:

- Jaccard(X,Y)=|X∩Y|/|XUY|
- Jaccard(X,X)=1
- Jaccard(X,Y)=0 if X∩Y=0

In [45]:
X=set([10,20,30,40])
Y=set([20,30,60])
print(jaccard_distance(X,Y))

0.6
