Tokenisation 

Textes en phrases

In [2]:
import nltk


In [3]:
nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rosel\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

Tokenization of text into sentences :

In [4]:
tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

In [5]:
text=" Hello everyone. Hope all are fine and doing well. Hope you find the book interesting"

In [6]:
tokenizer.tokenize(text)

[' Hello everyone.',
 'Hope all are fine and doing well.',
 'Hope you find the book interesting']

Tokenization of text in other languages :


In [7]:
french_tokenizer=nltk.data.load('tokenizers/punkt/french.pickle')


In [9]:
french_tokenizer.tokenize("Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage collège franco-britannique de Levallois-Perret. Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage Levallois. L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, janvier , d'un professeur d'histoire. L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, mercredi , d'un professeur d'histoire")


['Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage collège franco-britannique de Levallois-Perret.',
 'Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage Levallois.',
 "L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, janvier , d'un professeur d'histoire.",
 "L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, mercredi , d'un professeur d'histoire"]

Tokenization of sentences into words :

In [11]:
from nltk.tokenize import TreebankWordTokenizer

In [12]:
tokenizer = TreebankWordTokenizer()


In [13]:
tokenizer.tokenize("Have a nice day. I hope you find the book interesting")

['Have',
 'a',
 'nice',
 'day.',
 'I',
 'hope',
 'you',
 'find',
 'the',
 'book',
 'interesting']

TreebankWordTokenizer uses conventions according to Penn Treebank Corpus. It works by 
separating contractions. This is shown here:

In [15]:
text=nltk.word_tokenize(" Don't hesitate to ask questions")
print(text)

['Do', "n't", 'hesitate', 'to', 'ask', 'questions']


(function) def word_tokenize(
    text: Any,
    language: str = "english",
    preserve_line: bool = False
) -> list[str]

Return a tokenized copy of *text*, using NLTK's recommended word tokenizer

(currently an improved .TreebankWordTokenizer along with .PunktSentenceTokenizer for the specified language).




Another word tokenizer is PunktWordTokenizer . It works by splitting punctuation; each word is 
kept instead of creating an entirely new token. Another word tokenizer is WordPunctTokenizer . It 
provides splitting by making punctuation an entirely new token. This type of splitting is usually 
desirable:


In [16]:
from nltk.tokenize import WordPunctTokenizer

In [17]:
tokenizer=WordPunctTokenizer()
tokenizer.tokenize(" Don't hesitate to ask questions")

['Don', "'", 't', 'hesitate', 'to', 'ask', 'questions']

Tokenization using regular expressions(regex)


The tokenization of words can be performed by constructing regular expressions in these two ways:

• By matching with words

• By matching spaces or gaps

We can import RegexpTokenizer from NLTK. We can create a Regular Expression that can match 
the tokens present in the text:

In [18]:
from nltk.tokenize import RegexpTokenizer

In [23]:
tokenizer=RegexpTokenizer("[\w]+")


In [24]:
tokenizer.tokenize("Don't hesitate to ask questions")

['Don', 't', 'hesitate', 'to', 'ask', 'questions']

Instead of instantiating class, an alternative way of tokenization would be to use this function:

In [25]:
from nltk.tokenize import regexp_tokenize

In [28]:
sent="Don't hesitate to ask questions"
print(regexp_tokenize(sent, pattern='\w+|\$[\d\.]+|\S+'))


['Don', "'t", 'hesitate', 'to', 'ask', 'questions']


Conversion into lowercase and uppercase :

In [31]:
text='HARdWork IS KEy to SUCCESS'
print(text.lower())

hardwork is key to success


In [32]:
print(text.upper())

HARDWORK IS KEY TO SUCCESS


Dealing with stop words :

NLTK has a list of stop words for many languages. We need to unzip datafile so
that the list of stop words can be accessed from nltk_data/corpora/stopwords/ 

In [33]:
from nltk.corpus import stopwords

In [36]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rosel\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [37]:
stops=set(stopwords.words('english'))

In [42]:
words=["Don't", 'hesitate','to','ask','questions']
w1 = [word for word in words if word not in stops]
print(w1)

["Don't", 'hesitate', 'ask', 'questions']


The instance of nltk.corpus.reader.WordListCorpusReader is a stopwords corpus. It has the words() 
function, whose argument is fileid . Here, it is English; this refers to all the stop words present in the
English file. If the words() function has no argument, then it will refer to all the stop words of all 
the languages. Other languages in which stop word removal can be done, or the number of 
languages whose file of stop words is present in NLTK can be found using the fileids() function

In [43]:
stopwords.fileids()

['arabic',
 'azerbaijani',
 'basque',
 'bengali',
 'catalan',
 'chinese',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hebrew',
 'hinglish',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

In [3]:
import re

replacement_patterns = [
(r'don\'t', 'do not'),
(r'didn\'t', 'did not'),
(r'can\'t', 'cannot')
]

class RegexpReplacer(object):
   def __init__(self, patterns=replacement_patterns):
      self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns]

   def replace(self, text):
      s = text
      for (pattern, repl) in self.patterns:
           s = re.sub(pattern, repl, s)
      return s

replacer=RegexpReplacer()
replacer.replace("Don't hesitate to ask questions.")
print(replacer.replace("She must've gone to the market but she didn't go."))

She must've gone to the market but she did not go.


In [6]:
replacer= RegexpReplacer()

In [7]:
replacer.replace("Don't hesitate to ask questions")

"Don't hesitate to ask questions"

In [9]:
replacer.replace("She must've gone to the market but she didn't go")

"She must've gone to the market but she did not go"

The function of RegexpReplacer.replace() is substituting every instance of a replacement pattern 
with its corresponding substitution pattern. Here, must've is replaced by must have and didn't is 
replaced by did not , since the replacement pattern in replacers.py has already been defined by tuple
pairs, that is, (r'(\w+)\'ve', '\g<1> have') and (r'(\w+)n\'t', '\g<1> not') .
We can not only perform the replacement of contractions; we can also substitute a token with any 
other token.

In [10]:
from nltk.tokenize import word_tokenize

In [11]:
replacer=RegexpReplacer()


In [12]:
word_tokenize("Don't hesitate to ask questions")

['Do', "n't", 'hesitate', 'to', 'ask', 'questions']

In [13]:
word_tokenize(replacer.replace("Don't hesitate to ask questions"))

['Do', "n't", 'hesitate', 'to', 'ask', 'questions']