# Basics of text processing

### Natural Language Processing and Information Extraction,  2021 WS
Lecture 1, 10/23/2020

Gábor Recski

## In this lecture
- Regular Expressions

- Text segmentation and normalization:
   - sentence splitting and tokenization
   - lemmatization, stemming, decompounding, morphology

## Import dependencies

In [1]:
import re
from collections import Counter

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize
import stanza
stanza.download('en')

[nltk_data] Downloading package punkt to /home/recski/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.1.0.json: 122kB [00:00, 28.1MB/s]                    
2021-09-14 11:26:15 INFO: Downloading default packages for language: en (English)...
2021-09-14 11:26:16 INFO: File exists: /home/recski/stanza_resources/en/default.zip.
2021-09-14 11:26:20 INFO: Finished downloading models and saved to /home/recski/stanza_resources.


## Regular expressions

### Basics

![re1](media/re1.png)([SLP Ch.2](https://web.stanford.edu/~jurafsky/slp3/2.pdf))

In [3]:
text = open('data/alice.txt').read()
text[:100]

'\nCHAPTER I.\nDown the Rabbit-Hole\n\n\nAlice was beginning to get very tired of sitting by her sister on'

In [4]:
re.search('Alice', text)

<re.Match object; span=(35, 40), match='Alice'>

In [5]:
re.search('a', text)

<re.Match object; span=(22, 23), match='a'>

In [6]:
text[:41]

'\nCHAPTER I.\nDown the Rabbit-Hole\n\n\nAlice '

![re2](media/re2.png)([SLP Ch.2](https://web.stanford.edu/~jurafsky/slp3/2.pdf))

In [7]:
re.search('[Rr]abbit', text)

<re.Match object; span=(21, 27), match='Rabbit'>

In [8]:
re.findall('[Rr]abbit', text[:5000])

['Rabbit', 'Rabbit', 'Rabbit', 'Rabbit', 'rabbit', 'rabbit', 'rabbit']

In [9]:
for match in re.finditer('[Rr]abbit', text[:5000]):
    print(match.group(), match.span())

Rabbit (21, 27)
Rabbit (589, 595)
Rabbit (743, 749)
Rabbit (959, 965)
rabbit (1149, 1155)
rabbit (1341, 1347)
rabbit (1486, 1492)


![re3](media/re3.png)([SLP Ch.2](https://web.stanford.edu/~jurafsky/slp3/2.pdf))

In [10]:
re.findall(' [A-Za-z][a-z][a-z] ', text[:5000])

[' the ',
 ' was ',
 ' get ',
 ' her ',
 ' and ',
 ' she ',
 ' her ',
 ' was ',
 ' but ',
 ' had ',
 ' the ',
 ' she ',
 ' her ',
 ' she ',
 ' for ',
 ' day ',
 ' her ',
 ' and ',
 ' the ',
 ' the ',
 ' the ',
 ' was ',
 ' nor ',
 ' out ',
 ' the ',
 ' the ',
 ' say ',
 ' she ',
 ' her ',
 ' she ',
 ' but ',
 ' all ',
 ' but ',
 ' the ',
 ' out ',
 ' its ',
 ' and ',
 ' and ',
 ' her ',
 ' for ',
 ' her ',
 ' out ',
 ' and ',
 ' she ',
 ' and ',
 ' was ',
 ' see ',
 ' pop ',
 ' the ',
 ' the ',
 ' she ',
 ' get ',
 ' for ',
 ' and ',
 ' had ',
 ' she ',
 ' the ',
 ' was ',
 ' she ',
 ' for ',
 ' she ',
 ' her ',
 ' she ',
 ' and ',
 ' she ',
 ' but ',
 ' was ',
 ' see ',
 ' the ',
 ' the ',
 ' and ',
 ' and ',
 ' and ',
 ' she ',
 ' and ',
 ' She ',
 ' jar ',
 ' one ',
 ' the ',
 ' was ',
 ' but ',
 ' her ',
 ' was ',
 ' she ',
 ' not ',
 ' the ',
 ' for ',
 ' put ',
 ' one ',
 ' she ',
 ' How ',
 ' all ',
 ' say ',
 ' off ',
 ' the ',
 ' was ',
 ' the ',
 ' she ',
 ' the ',
 ' the ',


In [11]:
Counter(re.findall(' [A-Za-z][a-z][a-z] ', text)).most_common(10)

[(' the ', 1191),
 (' and ', 611),
 (' she ', 348),
 (' was ', 233),
 (' you ', 206),
 (' her ', 154),
 (' all ', 110),
 (' had ', 105),
 (' for ', 105),
 (' but ', 90)]

![re4](media/re4.png)([SLP Ch.2](https://web.stanford.edu/~jurafsky/slp3/2.pdf))

![re5](media/re5.png)([SLP Ch.2](https://web.stanford.edu/~jurafsky/slp3/2.pdf))

![re6](media/re6.png)([SLP Ch.2](https://web.stanford.edu/~jurafsky/slp3/2.pdf))

In [12]:
re.findall('...', text[:100])

['CHA',
 'PTE',
 'R I',
 'Dow',
 'n t',
 'he ',
 'Rab',
 'bit',
 '-Ho',
 'Ali',
 'ce ',
 'was',
 ' be',
 'gin',
 'nin',
 'g t',
 'o g',
 'et ',
 'ver',
 'y t',
 'ire',
 'd o',
 'f s',
 'itt',
 'ing',
 ' by',
 ' he',
 'r s',
 'ist',
 'er ']

![re7](media/re7.png)([SLP Ch.2](https://web.stanford.edu/~jurafsky/slp3/2.pdf))

In [13]:
re.findall('\w', text[:50])

['C',
 'H',
 'A',
 'P',
 'T',
 'E',
 'R',
 'I',
 'D',
 'o',
 'w',
 'n',
 't',
 'h',
 'e',
 'R',
 'a',
 'b',
 'b',
 'i',
 't',
 'H',
 'o',
 'l',
 'e',
 'A',
 'l',
 'i',
 'c',
 'e',
 'w',
 'a',
 's',
 'b',
 'e',
 'g',
 'i',
 'n']

In [14]:
re.split('\s', text[:100])

['',
 'CHAPTER',
 'I.',
 'Down',
 'the',
 'Rabbit-Hole',
 '',
 '',
 'Alice',
 'was',
 'beginning',
 'to',
 'get',
 'very',
 'tired',
 'of',
 'sitting',
 'by',
 'her',
 'sister',
 'on']

![re8](media/re8.png)([SLP Ch.2](https://web.stanford.edu/~jurafsky/slp3/2.pdf))

In [15]:
re.findall('\w+', text[:100])

['CHAPTER',
 'I',
 'Down',
 'the',
 'Rabbit',
 'Hole',
 'Alice',
 'was',
 'beginning',
 'to',
 'get',
 'very',
 'tired',
 'of',
 'sitting',
 'by',
 'her',
 'sister',
 'on']

In [16]:
Counter(re.findall('\w+', text)).most_common(20)

[('the', 1533),
 ('and', 803),
 ('to', 728),
 ('a', 617),
 ('it', 528),
 ('I', 523),
 ('she', 510),
 ('of', 502),
 ('said', 456),
 ('Alice', 396),
 ('in', 356),
 ('was', 351),
 ('you', 345),
 ('that', 274),
 ('as', 246),
 ('her', 244),
 ('t', 216),
 ('at', 202),
 ('s', 196),
 ('on', 189)]

In [17]:
Counter(re.findall('[^\w\s]', text)).most_common(20)

[(',', 2426),
 ('“', 1118),
 ('”', 1114),
 ('.', 987),
 ('’', 702),
 ('!', 451),
 ('—', 263),
 (':', 233),
 ('?', 203),
 (';', 193),
 ('-', 142),
 ('*', 60),
 ('(', 56),
 (')', 56),
 ('‘', 46),
 ('[', 2),
 (']', 2)]

### Substitution and groups

In [19]:
re.sub('\s+', ' ', text[:100])

' CHAPTER I. Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on'

In [21]:
print(re.sub('\s+', '\n', text[:100]))


CHAPTER
I.
Down
the
Rabbit-Hole
Alice
was
beginning
to
get
very
tired
of
sitting
by
her
sister
on


In [22]:
re.findall('CHAPTER [^\s]+', text)

['CHAPTER I.',
 'CHAPTER II.',
 'CHAPTER III.',
 'CHAPTER IV.',
 'CHAPTER V.',
 'CHAPTER VI.',
 'CHAPTER VII.',
 'CHAPTER VIII.',
 'CHAPTER IX.',
 'CHAPTER X.',
 'CHAPTER XI.',
 'CHAPTER XII.']

In [23]:
print(re.sub('CHAPTER ([^\s]+)', 'Chapter \\1', text[:100]))


Chapter I.
Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on


In [146]:
print(re.sub('CHAPTER ([^\s.]+).\n([^\n]*)', 'Chapter \\1: \\2', text[:100]))


Chapter I: Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on


In [144]:
re.findall('CHAPTER ([^\s.]+).\n([^\n]*)', text)

[('I', 'Down the Rabbit-Hole'),
 ('II', 'The Pool of Tears'),
 ('III', 'A Caucus-Race and a Long Tale'),
 ('IV', 'The Rabbit Sends in a Little Bill'),
 ('V', 'Advice from a Caterpillar'),
 ('VI', 'Pig and Pepper'),
 ('VII', 'A Mad Tea-Party'),
 ('VIII', 'The Queen’s Croquet-Ground'),
 ('IX', 'The Mock Turtle’s Story'),
 ('X', 'The Lobster Quadrille'),
 ('XI', 'Who Stole the Tarts?'),
 ('XII', 'Alice’s Evidence')]

Regular expressions are surprisingly powerful. Also, with the right implementation, they are literally as fast as you can get. That's because they are equivalent to [finite state automata (FSAs)](https://en.wikipedia.org/wiki/Finite-state_machine). Actually, every regular expression is a [regular grammar](https://en.wikipedia.org/wiki/Regular_grammar) defining a [regular language](https://en.wikipedia.org/wiki/Regular_language).

![re_xkcd](media/re_xkcd.png)([XKCD #208](https://xkcd.com/208/))

## Text segmentation

### Sentence splitting

#### How to split a text into sentences?

In [24]:
text2 = "'Of course it's only because Tom isn't home,' said Mrs. Parsons vaguely."  # TODO: example

Naive: split on `.`, `!`, `?`, etc.

In [25]:
re.split('[.!?]', text2)

["'Of course it's only because Tom isn't home,' said Mrs",
 ' Parsons vaguely',
 '']

Better: use language-specific list of abbreviation words, collocations, etc.

In [26]:
nltk.sent_tokenize(text2)

["'Of course it's only because Tom isn't home,' said Mrs. Parsons vaguely."]

###  Tokenization

#### How to  split a text into words?

#### Naive approach: split on whitespace

In [27]:
text2.split()

["'Of",
 'course',
 "it's",
 'only',
 'because',
 'Tom',
 "isn't",
 "home,'",
 'said',
 'Mrs.',
 'Parsons',
 'vaguely.']

#### Better: separate punctuation marks

In [28]:
re.findall('(\w+|[^\w\s]+)', text2)[:30]

["'",
 'Of',
 'course',
 'it',
 "'",
 's',
 'only',
 'because',
 'Tom',
 'isn',
 "'",
 't',
 'home',
 ",'",
 'said',
 'Mrs',
 '.',
 'Parsons',
 'vaguely',
 '.']

#### Best: add some language-specific conventions:

In [30]:
nltk.word_tokenize(text2)

["'Of",
 'course',
 'it',
 "'s",
 'only',
 'because',
 'Tom',
 'is',
 "n't",
 'home',
 ',',
 "'",
 'said',
 'Mrs.',
 'Parsons',
 'vaguely',
 '.']

## Text normalization

In [34]:
words = nltk.word_tokenize(text)

In [35]:
words[:10]

['CHAPTER',
 'I',
 '.',
 'Down',
 'the',
 'Rabbit-Hole',
 'Alice',
 'was',
 'beginning',
 'to']

In [36]:
Counter(words).most_common(10)

[(',', 2426),
 ('the', 1520),
 ('“', 1118),
 ('”', 1114),
 ('.', 783),
 ('and', 774),
 ('to', 718),
 ('’', 702),
 ('a', 611),
 ('it', 513)]

Let's get rid of punctuation

In [37]:
words = [word for word in words if re.match('\w', word)]

In [38]:
Counter(words).most_common(10)

[('the', 1520),
 ('and', 774),
 ('to', 718),
 ('a', 611),
 ('it', 513),
 ('I', 511),
 ('she', 507),
 ('of', 496),
 ('said', 453),
 ('Alice', 396)]

Filtering common function words is called __stopword removal__

In [39]:
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
print(stopwords)

{'myself', 'her', 'having', 't', 'was', 'again', 'didn', 'same', 'no', 'while', 'few', 'now', 'doing', "it's", 'as', 'i', 's', 'yourself', 'there', 'were', 'are', 'hadn', 'it', 'not', 'all', 'ma', 'him', "that'll", 'itself', 'about', 'both', 'did', 'you', 'from', 'o', "aren't", 'this', 'ourselves', 'of', 'theirs', "won't", 'he', 'until', 'just', 'have', 'that', 'below', 'some', 'll', "hasn't", 'but', 'here', 'mustn', 'we', 'by', 'his', 'a', 'too', "couldn't", 'haven', 'weren', "mustn't", "wouldn't", 'm', 'be', 'above', 'will', "isn't", 'very', "weren't", 'the', 'with', 'don', 'hasn', "wasn't", "should've", "haven't", 'how', 'down', "you've", 'into', 'further', 'aren', 'our', 'for', 'd', "mightn't", 'which', 'am', 'do', 'me', 'only', 'or', 'who', 'to', 'off', 'on', "doesn't", "she's", 'then', 'y', 'why', 'wasn', 'she', "don't", 'ours', 'if', 'own', 'before', 've', 'they', "didn't", 'after', 'these', 'under', 'needn', 'any', 'an', 'more', 'ain', 'whom', 'nor', 'at', 'each', 'what', 'such

In [40]:
words = [word for word in words if word.lower() not in stopwords]

In [41]:
Counter(words).most_common(20)

[('said', 453),
 ('Alice', 396),
 ('little', 125),
 ('one', 93),
 ('went', 83),
 ('like', 83),
 ('thought', 74),
 ('could', 74),
 ('Queen', 74),
 ('know', 72),
 ('would', 70),
 ('time', 64),
 ('see', 64),
 ('King', 61),
 ('began', 57),
 ('Mock', 56),
 ('Turtle', 56),
 ('Hatter', 55),
 ('Gryphon', 55),
 ('quite', 53)]

### Lemmatization and stemming

Words like _say_, _says_, and _said_ are all different **word forms** of the same **lemma**. Grouping them together can be useful in many applications. 

**Stemming** is the reduction of words to a common prefix, using simple rules that only work some of the time:

In [42]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [43]:
for word in ('say', 'says', 'said'):
    print(stemmer.stem(word))

say
say
said


In [44]:
for word in ('he', 'his', 'him'):
    print(stemmer.stem(word))

he
hi
him


**Lemmatization** is the mapping of word forms to their lemma, using either a dictionary of word forms, a grammar of how words are formed (a **morphology**), or both.

In [45]:
nlp = stanza.Pipeline('en')

2021-09-14 11:40:27 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | ewt       |
| pos       | ewt       |
| lemma     | ewt       |
| depparse  | ewt       |
| sentiment | sstplus   |
| ner       | ontonotes |

2021-09-14 11:40:27 INFO: Use device: cpu
2021-09-14 11:40:27 INFO: Loading: tokenize
2021-09-14 11:40:27 INFO: Loading: pos
2021-09-14 11:40:28 INFO: Loading: lemma
2021-09-14 11:40:28 INFO: Loading: depparse
2021-09-14 11:40:29 INFO: Loading: sentiment
2021-09-14 11:40:30 INFO: Loading: ner
2021-09-14 11:40:31 INFO: Done loading processors!


In [46]:
doc = nlp(text[:1000])

In [47]:
for sentence in doc.sentences[:5]:
    for word in sentence.words:
        print(word.text + '\t' + word.lemma)
    print()

CHAPTER	chapter
I.	I.
Down	down
the	the
Rabbit	rabbit
-	-
Hole	Hole

Alice	Alice
was	be
beginning	begin
to	to
get	get
very	very
tired	tired
of	of
sitting	sit
by	by
her	she
sister	sister
on	on
the	the
bank	bank
,	,
and	and
of	of
having	have
nothing	nothing
to	to
do	do
:	:
once	once
or	or
twice	twice
she	she
had	have
peeped	peep
into	into
the	the
book	book
her	she
sister	sister
was	be
reading	read
,	,
but	but
it	it
had	have
no	no
pictures	picture
or	or
conversations	conversation
in	in
it	it
,	,
“	"
and	and
what	what
is	be
the	the
use	use
of	of
a	a
book	book
,	,
”	"
thought	think
Alice	Alice
“	"
without	without
pictures	picture
or	or
conversations	conversation
?	?
”	"

So	so
she	she
was	be
considering	consider
in	in
her	she
own	own
mind	mind
(	(
as	as
well	well
as	as
she	she
could	could
,	,
for	for
the	the
hot	hot
day	day
made	make
her	she
feel	feel
very	very
sleepy	sleepy
and	and
stupid	stupid
)	)
,	,
whether	whether
the	the
pleasure	pleasure
of	of
making	make
a	a
daisy	daisy
-	-
chain	c

The full analysis of how a word form is built from its lemma is known as **morphological analysis**

In [49]:
for sentence in doc.sentences[:5]:
    for word in sentence.words:
        print('\t'.join([word.text, word.lemma, word.upos, word.feats if word.feats else '']))
    print()

CHAPTER	chapter	NOUN	Number=Sing
I.	I.	PROPN	Number=Sing
Down	down	ADP	
the	the	DET	Definite=Def|PronType=Art
Rabbit	rabbit	NOUN	Number=Sing
-	-	PUNCT	
Hole	Hole	PROPN	Number=Sing

Alice	Alice	PROPN	Number=Sing
was	be	AUX	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
beginning	begin	VERB	Tense=Pres|VerbForm=Part
to	to	PART	
get	get	VERB	VerbForm=Inf
very	very	ADV	
tired	tired	ADJ	Degree=Pos
of	of	SCONJ	
sitting	sit	VERB	VerbForm=Ger
by	by	ADP	
her	she	PRON	Gender=Fem|Number=Sing|Person=3|Poss=Yes|PronType=Prs
sister	sister	NOUN	Number=Sing
on	on	ADP	
the	the	DET	Definite=Def|PronType=Art
bank	bank	NOUN	Number=Sing
,	,	PUNCT	
and	and	CCONJ	
of	of	SCONJ	
having	have	VERB	VerbForm=Ger
nothing	nothing	PRON	Number=Sing
to	to	PART	
do	do	VERB	VerbForm=Inf
:	:	PUNCT	
once	once	ADV	NumType=Mult
or	or	CCONJ	
twice	twice	SCONJ	
she	she	PRON	Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs
had	have	AUX	Mood=Ind|Tense=Past|VerbForm=Fin
peeped	peep	VERB	Tense=Past|VerbForm=Part
into	in

A special case of lemmatization is **decompounding**, recognizing multiple lemmas in a word

In [50]:
nlp('wastebasket')

[
  [
    {
      "id": 1,
      "text": "wastebasket",
      "lemma": "wastebasket",
      "upos": "NOUN",
      "xpos": "NN",
      "feats": "Number=Sing",
      "head": 0,
      "deprel": "root",
      "misc": "start_char=0|end_char=11",
      "ner": "O"
    }
  ]
]

For English you might say that this is good enough... but _some languages_ allow forming compounds on the fly...

In [51]:
nlp_de = stanza.Pipeline('de')

2021-09-14 11:42:50 INFO: Loading these models for language: de (German):
| Processor | Package |
-----------------------
| tokenize  | gsd     |
| mwt       | gsd     |
| pos       | gsd     |
| lemma     | gsd     |
| depparse  | gsd     |
| sentiment | sb10k   |
| ner       | conll03 |

2021-09-14 11:42:50 INFO: Use device: cpu
2021-09-14 11:42:50 INFO: Loading: tokenize
2021-09-14 11:42:50 INFO: Loading: mwt
2021-09-14 11:42:50 INFO: Loading: pos
2021-09-14 11:42:51 INFO: Loading: lemma
2021-09-14 11:42:51 INFO: Loading: depparse
2021-09-14 11:42:52 INFO: Loading: sentiment
2021-09-14 11:42:53 INFO: Loading: ner
2021-09-14 11:42:54 INFO: Done loading processors!


In [52]:
nlp_de('Kassenidentifikationsnummer')

[
  [
    {
      "id": 1,
      "text": "Kassenidentifikationsnummer",
      "lemma": "Kassenidentifikationsnummer",
      "upos": "NOUN",
      "xpos": "NN",
      "feats": "Case=Nom|Gender=Neut|Number=Sing",
      "head": 0,
      "deprel": "root",
      "misc": "start_char=0|end_char=27",
      "ner": "O"
    }
  ]
]

There is no good solution and no standard tool. There are some unsupervised approaches like [SECOS](https://github.com/riedlma/SECOS) and [CharSplit](https://github.com/dtuggener/CharSplit), and there are also full-fledged morpohological analyzers that might work, like [SMOR](https://www.cis.lmu.de/~schmid/tools/SMOR/) and its extensions [zmorge](https://pub.cl.uzh.ch/users/sennrich/zmorge/) and [SMORLemma](https://github.com/rsennrich/SMORLemma).

## Examples

### Text processing with regular expressions

Load a sample text

In [53]:
print(text[:1000])


CHAPTER I.
Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into
the book her sister was reading, but it had no pictures or
conversations in it, “and what is the use of a book,” thought Alice
“without pictures or conversations?”

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure of
making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.

There was nothing so _very_ remarkable in that; nor did Alice think it
so _very_ much out of the way to hear the Rabbit say to itself, “Oh
dear! Oh dear! I shall be late!” (when she thought it over afterwards,
it occurred to her that she ought to have wondered at this, but at the
time it all seemed quite natural); but when the Rabbit actually _took a
watch out of its 

In [54]:
def clean_text(text):
    cleaned_text = re.sub('_','',text)
    cleaned_text = re.sub('\n', ' ', cleaned_text)
    return cleaned_text

In [55]:
text = clean_text(text)

In [56]:
print(text[:1000])

 CHAPTER I. Down the Rabbit-Hole   Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations?”  So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.  There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, “Oh dear! Oh dear! I shall be late!” (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waist

Let's split this into sentences, then words.

In [57]:
sens = sent_tokenize(text)

In [58]:
print('\n\n'.join(sens[:5]))

 CHAPTER I.

Down the Rabbit-Hole   Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, “and what is the use of a book,” thought Alice “without pictures or conversations?”  So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.

There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, “Oh dear!

Oh dear!

I shall be late!” (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its wa

In [59]:
toks = [word_tokenize(sen) for sen in sens]

In [60]:
print('\n\n'.join('\n'.join(sen) for sen in toks[:5]))

CHAPTER
I
.

Down
the
Rabbit-Hole
Alice
was
beginning
to
get
very
tired
of
sitting
by
her
sister
on
the
bank
,
and
of
having
nothing
to
do
:
once
or
twice
she
had
peeped
into
the
book
her
sister
was
reading
,
but
it
had
no
pictures
or
conversations
in
it
,
“
and
what
is
the
use
of
a
book
,
”
thought
Alice
“
without
pictures
or
conversations
?
”
So
she
was
considering
in
her
own
mind
(
as
well
as
she
could
,
for
the
hot
day
made
her
feel
very
sleepy
and
stupid
)
,
whether
the
pleasure
of
making
a
daisy-chain
would
be
worth
the
trouble
of
getting
up
and
picking
the
daisies
,
when
suddenly
a
White
Rabbit
with
pink
eyes
ran
close
by
her
.

There
was
nothing
so
very
remarkable
in
that
;
nor
did
Alice
think
it
so
very
much
out
of
the
way
to
hear
the
Rabbit
say
to
itself
,
“
Oh
dear
!

Oh
dear
!

I
shall
be
late
!
”
(
when
she
thought
it
over
afterwards
,
it
occurred
to
her
that
she
ought
to
have
wondered
at
this
,
but
at
the
time
it
all
seemed
quite
natural
)
;
but
when
the
Rabbit
actually
t

Let's also write this to a file

In [71]:
with open('data/alice_tok.txt', 'w') as f:
    f.write('\n\n'.join('\n'.join(sen) for sen in toks) + '\n')

Let's try to find all names using regexes

In [62]:
def find_names(toks):
    curr_name = []
    for sen in toks:
        for tok in sen[1:]:
            if re.match('[A-Z][a-z]+', tok):
                curr_name.append(tok)
            elif curr_name:
                yield ' '.join(curr_name)
                curr_name = []
                
        if curr_name:
            yield curr_name
            
        
def count_names(toks):
    name_counter = Counter()
    
    for name in find_names(toks):
        name_counter[name] += 1
    
    for name, count in name_counter.most_common():
        print(name, count)

In [63]:
count_names(toks)

Alice 341
Queen 67
King 61
Gryphon 54
Hatter 53
Mock Turtle 53
And 51
You 48
It 46
What 40
Duchess 40
Dormouse 38
March Hare 30
But 29
Mouse 26
That 25
Caterpillar 25
Oh 24
The 24
Well 23
White Rabbit 22
Cat 22
She 21
Rabbit 20
How 20
Why 20
Come 16
If 16
There 16
They 14
He 14
Then 14
So 13
Dinah 13
Dodo 13
Bill 13
No 12
As 12
Yes 12
We 11
Of 11
Pigeon 11
Majesty 11
Do 9
Now 9
For 9
Who 9
Not 9
This 9
Off 8
Lory 7
In 7
Footman 7
Five 7
The Queen 7
Knave 7
Which 6
Just 6
When 6
Lizard 6
Hearts 6
Soup 6
English 5
Would 5
Are 5
Let 5
Here 5
Two 5
Seven 5
Soo—oop 5
Mabel 4
With 4
The Mouse 4
French 4
One 4
Said 4
Please 4
Ah 4
Hold 4
Sure 4
Father William 4
At 4
Don 4
Cheshire Cat 4
Call 4
Very 4
Nothing 4
Perhaps 3
William 3
Duck 3
Eaglet 3
Did 3
Everybody 3
Only 3
Mary Ann 3
Is 3
Yet 3
Pray 3
Serpent 3
While 3
All 3
Have 3
Exactly 3
Time 3
Take 3
Consider 3
Tis 3
Never 3
Thank 3
Lobster Quadrille 3
Owl 3
Beautiful 3
Give 3
Latitude 2
Longitude 2
Ma 2
Paris 2
Morcar 2
Mercia 2
Found 2
Fu

We can filter our tokens for stopwords:

In [65]:
toks_without_stopwords = [[tok for tok in sen if tok.lower() not in stopwords] for sen in toks]

In [66]:
print('\n\n'.join('\n'.join(sen) for sen in toks_without_stopwords[:5]))

CHAPTER
.

Rabbit-Hole
Alice
beginning
get
tired
sitting
sister
bank
,
nothing
:
twice
peeped
book
sister
reading
,
pictures
conversations
,
“
use
book
,
”
thought
Alice
“
without
pictures
conversations
?
”
considering
mind
(
well
could
,
hot
day
made
feel
sleepy
stupid
)
,
whether
pleasure
making
daisy-chain
would
worth
trouble
getting
picking
daisies
,
suddenly
White
Rabbit
pink
eyes
ran
close
.

nothing
remarkable
;
Alice
think
much
way
hear
Rabbit
say
,
“
Oh
dear
!

Oh
dear
!

shall
late
!
”
(
thought
afterwards
,
occurred
ought
wondered
,
time
seemed
quite
natural
)
;
Rabbit
actually
took
watch
waistcoat-pocket
,
looked
,
hurried
,
Alice
started
feet
,
flashed
across
mind
never
seen
rabbit
either
waistcoat-pocket
,
watch
take
,
burning
curiosity
,
ran
across
field
,
fortunately
time
see
pop
large
rabbit-hole
hedge
.


In [67]:
count_names(toks_without_stopwords)

Alice 342
Queen 67
King 55
Mock Turtle 51
Gryphon 51
Hatter 49
Duchess 39
Dormouse 35
March Hare 29
Mouse 28
Rabbit 24
Oh 24
Caterpillar 24
Well 23
Cat 19
White Rabbit 17
Come 16
Dinah 13
Dodo 12
Yes 12
Bill 12
Pigeon 11
Majesty 11
Footman 8
Lory 6
Lizard 6
Five 6
Soup 6
English 5
Would 5
Let 5
Two 5
Knave 5
Soo—oop 5
Mabel 4
French 4
One 4
Said 4
Ah 4
Hold 4
Sure 4
Father William 4
Cheshire Cat 4
Call 4
Seven 4
Nothing 4
Perhaps 3
Everybody 3
Please 3
Mary Ann 3
Yet 3
Pray 3
Serpent 3
Exactly 3
Time 3
Queen Hearts 3
Take 3
Knave Hearts 3
Consider 3
Tis 3
Never 3
Thank 3
Lobster Quadrille 3
Beautiful 3
Give 3
Paris 2
Found 2
Duck 2
Eaglet 2
Fury 2
Crab 2
Nobody 2
Pat 2
Explain 2
Keep 2
Pepper 2
Cheshire 2
Wow 2
May 2
Suppose 2
Twinkle 2
Wake 2
Tell 2
Treacle 2
Really 2
Miss 2
Turn 2
Get 2
Turtle 2
Tortoise 2
Uglification 2
Go 2
Panther 2
Owl 2
Beautiful Soup 2
Silence 2
First 2
Unimportant 2
Adventures 2
Wonderland 2
Latitude Longitude 1
Latitude 1
Longitude 1
New Zealand Australia 1
D

Let's also write the stopwords into a file

In [70]:
with open('data/stopwords.txt', 'w') as f:
    f.write('\n'.join(sorted(stopwords)) + '\n')

Continue to [Text processing on the Linux command line](01b_Text_processing_Linux_command_line)