<h2 style="text-align: center;" id="part-1">Getting Started with NLTK</h2>

In [None]:
! pip install nltk -i https://pypi.tuna.tsinghua.edu.cn/simple
#!conda install anaconda::nltk --yes

In [None]:
# before import you should install nltk
import nltk
# download all datasets in nltk
download_dir = '/home/jovyan/work/nltk_data/'
nltk.data.path.append('/home/jovyan/work/nltk_data/')
nltk.download('all', download_dir=download_dir)

In [None]:
# from NLTK's book module, load all items.
from nltk.book import *

In [None]:
# enter their names at the Python prompt
print(text1)
print(text2)

In [None]:
print(sent1)

In [None]:
print(sent2)

<h2 style="text-align: center;">Frequency Distributions</h2>

Frequency distribution tells us the frequency of each vocabulary in the text. It is a "distribution" because it tells us how the total number of word tokens in the text are distributed across the vocabulary items. Since we often need frequency distributions in language processing, NLTK provides built-in support for them. 

In [None]:
# Task: create a frequency distribution for text1 (Moby-Dick text)

# A frequency distribution for the outcomes of an experiment. A frequency distribution 
# records the number of times each outcome of an experiment has occurred. For example, 
# a frequency distribution could be used to record the frequency of each word type in 
# a document. 

fdist1 = FreqDist(text1)
print(fdist1)

In [None]:
# Task: find the 50 most frequent words of text1
print(fdist1.most_common(50))

In [None]:
# Task: find the frequency of word 'whale' in text1
print(fdist1['whale'])

In [None]:
!conda install conda-forge::matplotlib --yes

In [None]:
# Task: plot Probability Density Function
fdist1.plot(50, cumulative=False)
# What do you find ?

<h2 style="text-align: center;">Fine-grained Selection of Words</h2>

In [None]:
# Task: look at the long words of the book Moby-Dick (text1)
#       find all words that have at least 15 chars.
V = set(text1)
long_words = [w for w in V if len(w) > 15]
print(sorted(long_words))

In [None]:
# Task: look at the long words of the Inaugural Address Corpus (text4)
#       find all words that have at least 15 chars.
V = set(text4)
long_words = [w for w in V if len(w) > 15]
print(sorted(long_words))

# What do you find ?

In [None]:
# Task: find all long words of text5
print(sorted([w for w in set(text5) if len(w) > 15]))

# What do you find ?

In [None]:
# Task: find frequently occurring long words. 
fdist5 = FreqDist(text5)
print(sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7))

# do you find ?

In [None]:
# Task: find hapaxes (hapax legomenon, 孤立词，文本中出现一次)
fdist1 = FreqDist(text1)
hapax = fdist1.hapaxes()
print(hapax[:10])

# What do you find ?

<h2 style="text-align: center;">Collocations and Bigrams</h2>

In [None]:
# Task: generate bigrams from a word list
# A collocation is a sequence of words that occur together unusually often. 
# Thus red wine is a collocation,  whereas the wine is not.
list(bigrams(['more', 'is', 'said', 'than', 'done']))

In [None]:
# Task: find frequent bigrams of text4 (Inaugural Address Corpus)
text4.collocations(20)

In [None]:
# Task: find frequent bigrams of text8 (Personals Corpus 
#       comes from personal ads posted on various online 
#       dating sites)
text8.collocations() 

<h2 style="text-align: center;">Counting</h2>

In [None]:
# Task: find and cluster frequency of word length

# For example, we can look at the distribution of word lengths 
# in a text, by creating a FreqDist out of a long list of numbers, 
# where each number is the length of the corresponding word in the text:

fdist = FreqDist(len(w) for w in text1)
print(fdist)
fdist

In [None]:
# Task: show all frequent of the different lengths of words
fdist.most_common()

In [None]:
# Task: find a specific frequency the most frequency length
print(fdist.max())
print(fdist[3])
print(fdist.freq(3))

In [None]:
# Task: what is this ?
print(sorted(w for w in set(text1) if w.endswith('ableness'))[:10])

In [None]:
# Task: what is this ?
print(sorted(term for term in set(text4) if 'gnt' in term))

In [None]:
# Task: what is this ?
print(sorted(item for item in set(text6) if item.istitle())[:10])

In [None]:
# Task: what is this ?
print(sorted(item for item in set(sent7) if item.isdigit()))

In [None]:
print(sorted(w for w in set(text7) if '-' in w and 'index' in w))
print(sorted(wd for wd in set(text3) if wd.istitle() and len(wd) > 10))
print(sorted(w for w in set(sent7) if not w.islower()))
print(sorted(t for t in set(text2) if 'cie' in t or 'cei' in t))

In [None]:
captical_words = [w.upper() for w in text1]
print(captical_words[:10])

In [None]:
print(text1)
print(len(text1))
print(len(set(text1)))
print(len(set(word.lower() for word in text1)))
# merge words like The the.

In [None]:
# eliminate numbers and punctuation from the vocabulary count by 
# filtering out any non-alphabetic items:
print(len(set(word.lower() for word in text1 if word.isalpha())))

In [None]:
# Check the word type for sent1
for token in sent1:
    if token.islower():
        print(f'{token:10} is a lowercase word')
    elif token.istitle():
        print(f'{token:10} is a titlecase word')
    else:
        print(f'{token:10} is punctuation')

In [None]:
# create a list of cie and cei words, 
# then we loop over each item and print it. 
tricky = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w)
for word in tricky:
    print(word, end=' ')

<h2 style="text-align: center;" id="part-2">Searching Text</h2>

In [None]:
import nltk
from nltk.book import *
import matplotlib
import matplotlib.pyplot as plt

In [None]:
# Task: look up the context of word "monstrous" in Moby Dick (text1) 
print('-'*17)
text1.concordance("monstrous")
print('-'*17)
text1.concordance("fudan")

In [None]:
# Task: search Sense and Sensibility (text2) for the word "affection"
print('-'*17)
text2.concordance("affection")

In [None]:
# Task: search the book of Genesis (text3) to find out how long some people lived
text3.concordance("lived")

In [None]:
# Task: look at text4, the Inaugural Address Corpus, to see examples of English going back to 1789, 
print('-'*17)
text4.concordance("nation")
print('-'*17)
text4.concordance("terror")
# see how these words have been used differently over time. 

In [None]:
# Task: find lol context in text5, the NPS Chat Corpus: 
#       search this for unconventional words like im, ur, lol.
text5.concordance("lol")

In [None]:
# Task: find similar context,
#.      we saw that monstrous occurred in contexts such as 
#.      the ___ pictures and a ___ size. What other words 
#.      appear in a similar range of contexts?
text1.concordance("monstrous")
print('-'*17)
text1.similar("monstrous")

In [None]:
# Task: find similar context of monstrous in text2
text2.concordance("monstrous")
print('-'*17)
text2.similar("monstrous")

# Observe that we get different results for different texts. 
# Austen uses this word quite differently from Melville; 
# for her, monstrous has positive connotations, and sometimes 
# functions as an intensifier like the word very.

In [None]:
# Task: find common context of two words
# The term "common_contexts" allows us to examine 
# just the contexts that are shared by two or more words
text1.common_contexts(["monstrous", "very"])
text2.common_contexts(["monstrous", "very"])

In [None]:
# Task: show the dispersion plot for Elinor, Edward, Marianne, Willoughby
text2.dispersion_plot(["Elinor", "Edward", "Marianne", "Willoughby"])
# TODO: There is a bug in the code.

In [None]:
# Task: generate some random text in the various styles we have just seen.
text3.generate()

In [None]:
!conda install conda-forge::wordcloud --yes

In [None]:
# Task: generate word cloud
#.      you may need to install wordcloud first
import sys
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

In [None]:
words = [w for w in text1]
fd = nltk.FreqDist(words).most_common()
wc = WordCloud(background_color='white', max_words=2000, stopwords=STOPWORDS, max_font_size=50,
              random_state=17)
wc.generate(' '.join(words))
plt.rcParams["figure.figsize"] = (6, 4)
plt.imshow(wc)
plt.axis('off')
plt.show()

In [None]:
data_analysis = nltk.FreqDist(text1)
filter_words = dict([(m,n) for m,n in data_analysis.items() if len(m) > 3])
data_analysis = nltk.FreqDist(filter_words)
data_analysis.plot(25,cumulative=False)

<h2 style="text-align: center;">Web and Chat Text</h2>

In [None]:
from nltk.corpus import webtext
for fileid in webtext.fileids():
    print(fileid, webtext.raw(fileid)[:65], '...')

<h2 style="text-align: center;">spaCy </h2>

In [None]:
!conda install conda-forge::spacy --yes

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

<h2 style="text-align: center;">Regular Expression</h2>

In [None]:
!conda install anaconda::seaborn --yes

In [None]:
# in Python, there is a built in lib re, we can import them
import re
import nltk
import seaborn as sn
import matplotlib.pyplot as plt

In [None]:
# Task: Find woodchuck or Woodchuck : Disjunction
test_str = "This string contains Woodchuck and woodchuck."
result=re.search(pattern="[wW]oodchuck", string=test_str)
print(result)
result=re.search(pattern=r"[wW]ooodchuck", string=test_str)
print(result)

In [None]:
# Find the word "woodchuck" in the following test string
test_str = "interesting links to woodchucks ! and lemurs!"
re.search(pattern="woodchuck", string=test_str)

# Find !, it follows the same way:
print(re.search(pattern="!", string=test_str))
print(re.search(pattern="!!", string=test_str))
assert re.search(pattern="!!", string=test_str) == None # match nothing


In [None]:
# Find any single digit in a string.
result=re.search(pattern=r"[0123456789]", string="plenty of 7 to 5")
print(result)
result=re.search(pattern=r"[0-9]", string="plenty of 7 to 5")
print(result)

In [None]:
# Negation: If the caret ^ is the first symbol after [,
# the resulting pattern is negated. For example, the pattern 
# [^a] matches any single character (including special characters) except a.

# -- not an upper case letter
print(re.search(pattern=r"[^A-Z]", string="Oyfn pripetchik"))

# -- neither 'S' nor 's'
print(re.search(pattern=r"[^Ss]", string="I have no exquisite reason for't"))

# -- not a period
print(re.search(pattern=r"[^.]", string="our resident Djinn"))

# -- either 'e' or '^'
print(re.search(pattern=r"[e^]", string="look up ^ now"))

# -- the pattern ‘a^b’
print(re.search(pattern=r'a\^b', string=r'look up a^b now'))

In [None]:
# More disjuncations
str1 = "Woodchucks is another name for groundhog!"
result = re.search(pattern="groundhog|woodchuck",string=str1)
print(result)

str1 = "Find all woodchuckk Woodchuck Groundhog groundhogxxx!"
result = re.findall(pattern="[gG]roundhog|[Ww]oodchuck",string=str1)
print(result)

In [None]:
# Some special chars

# ?: Optional previous char
str1 = "Find all color colour colouur colouuur colouyr"
result = re.findall(pattern="colou?r",string=str1)
print(result)

# *: 0 or more of previous char
str1 = "Find all color colour colouur colouuur colouyr"
result = re.findall(pattern="colou*r",string=str1)
print(result)

# +: 1 or more of previous char
str1 = "baa baaa baaaa baaaaa"
result = re.findall(pattern="baa+",string=str1)
print(result)
# .: any char
str1 = "begin begun begun beg3n"
result = re.findall(pattern="beg.n",string=str1)
print(result)
str1 = "The end."
result = re.findall(pattern="\.$",string=str1)
print(result)
str1 = "The end? The end. #t"
result = re.findall(pattern=".$",string=str1)
print(result)

In [None]:
# find all "the" in a raw text.
text = "If two sequences in an alignment share a common ancestor, \
mismatches can be interpreted as point mutations and gaps as indels (that \
is, insertion or deletion mutations) introduced in one or both lineages in \
the time since they diverged from one another. In sequence alignments of \
proteins, the degree of similarity between amino acids occupying a \
particular position in the sequence can be interpreted as a rough \
measure of how conserved a particular region or sequence motif is \
among lineages. The absence of substitutions, or the presence of \
only very conservative substitutions (that is, the substitution of \
amino acids whose side chains have similar biochemical properties) in \
a particular region of the sequence, suggest [3] that this region has \
structural or functional importance. Although DNA and RNA nucleotide bases \
are more similar to each other than are amino acids, the conservation of \
base pairs can indicate a similar functional or structural role."
matches = re.findall("[^a-zA-Z][tT]he[^a-zA-Z]", text)
print(matches)

In [None]:
# A nicer way is to do the following

matches = re.findall(r"\b[tT]he\b", text)
print(matches)

<h2 style="text-align: center;">Words and Corpus</h2>

In [None]:
# try to download some corpus
import nltk
nltk.download('brown')

In [None]:
from nltk.corpus import brown
from nltk.corpus import gutenberg
from nltk.corpus import indian
from nltk.corpus import conll2007

### Word types and word instances (tokens)

- **Word types** are the number of distinct words in a corpus; if the set of words in the vocabulary is $V$, the number of types is the vocabulary size $|V|$. 

- **Word instances** are the total number $N$ of running words.

In [None]:
print(brown.words())
print(f"total number of tokens in Brown corpus: {len(brown.words())}")
for cat in brown.categories():
    print(f"category {cat} has {len(brown.words(categories=cat))} tokens")
print(f"It has {len(nltk.FreqDist(w.lower() for w in brown.words()))} case-insensitive types")
print(f"It has {len(nltk.FreqDist(w for w in brown.words()))} case-senstive types")

In [None]:
news_text = brown.words(categories='news')
fdist = len(nltk.FreqDist(w.lower() for w in news_text))
fdist_case_sensitive = len(nltk.FreqDist(w for w in news_text))
print(f"there are {fdist} different words in news category!")
print(f"there are {fdist_case_sensitive} case sensitive words in news category!")

In [None]:
fdist = len(nltk.FreqDist(w.lower() for w in brown.words()))
fdist_case_sensitive = len(nltk.FreqDist(w for w in brown.words()))
print(f"there are {fdist} different words among all category!")
print(f"there are {fdist_case_sensitive} case sensitive words among all category!")

In [None]:
from nltk.corpus import brown
print(f"all categories of brown: {brown.categories()}")
print(f"all words in news: {brown.words(categories='news')}")
print(brown.words(fileids=['cg22']))
print(brown.sents(categories=['news', 'editorial', 'reviews']))

In [None]:
print(f"brown corpus has {len(brown.fileids())} files in total, it belongs to {len(brown.categories())} categories")
print(f"first 10 file names: {brown.fileids()[:10]}")

In [None]:
from nltk.corpus import gutenberg
print(gutenberg.fileids())

In [None]:
emma_words = gutenberg.words('austen-emma.txt')
type(emma_words)
print(gutenberg.words('austen-emma.txt'))
# How many tokens in the text:
print("Token count:", len(emma_words))

# What is the token at index 1000?
print("token at index 1000:", emma_words[1000])

# Slice from token 1400 to 1500
print("slice from 1400 to 1500:", emma_words[1400:1500])

<h2 style="text-align: center;">Word Tokenization</h2>

There are two type of tokenizations

- **Top-down tokenization**: We define a standard and implement rules to implement that kind of tokenization.
  - word tokenization
  - charater tokenization
- **Bottom-up tokenization**: We use simple statistics of letter sequences to break up words into subword tokens.
  - subword tokenization (modern LLMs use this type!)

### Top-down (rule-based) tokenization - word tokenization

In [None]:
# Use split method via the whitespace " "
text = """While the Unix command sequence just removed all the numbers and punctuation"""
print(text.split(" "))

In [None]:
# But, we have punctuations, icons, and many other small issues.
text = """Don't you love 🤗 Transformers? We sure do."""
print(text.split(" "))

In [None]:
# Top-down tokenization by using regular expression
pattern = r'''(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A. 
| \w+(?:-\w+)* # words with optional internal hyphens 
| \$?\d+(?:\.\d+)?%? # currency, percentages, e.g. $12.40, 82% 
| \.\.\. # ellipsis 
| [][.,;"'?():_`-] # these are separate tokens; includes ], [
'''
print(f'pattern needs to match is: \n\n{pattern}')

In [None]:
text = """Don't you love 🤗 Transformers? We sure do."""
print(f"tokenized words after pattern matching: \n\n{nltk.regexp_tokenize(text, pattern)}")

In [None]:
# spacy works much better
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
for token in doc: 
    print(token)

In [None]:
text = """While the Unix command sequence just removed all the numbers and punctuation,
for most NLP applications we’ll . But we’ll often want
to keep the punctuation that occurs word internally, in examples like m.p.h., Ph.D.,
AT&T, and cap’n. Special characters and numbers will need to be kept in prices
($45.55) and dates (01/02/06); we don’t want to segment that price into separate
tokens of “45” and “55”. And there are URLs (https://www.stanford.edu),
Twitter hashtags (#nlproc), or email addresses (someone@cs.colorado.edu).
Number expressions introduce other complications as well; while commas normally
appear at word boundaries, commas are used inside numbers in English, every
three digits: 555,500.50. (or sometimes periods)
where English puts commas, for example, 555 500,50."""
text = text.replace("\n", " ").strip()
print(f"tokenized words after pattern matching: \n\n{nltk.regexp_tokenize(text, pattern)}")

In [None]:
# spacy works much better
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
for token in doc[:10]: 
    print(token)

In [None]:
# Tokenization is more complex in languages like written Chinese, Japanese.
from nltk.tokenize.treebank import TreebankWordTokenizer
text = '姚明进入总决赛'
t = TreebankWordTokenizer()
toks = t.tokenize(text)
print(toks)

In [None]:
!conda install conda-forge::jieba --yes

In [None]:
# StanfordSegmenter for Chinese 
from nltk.tokenize.stanford_segmenter import StanfordSegmenter
# Note, it needs to install jar file.
# Alternative way to tokenize Chinese words
# install jieba via conda as: conda install conda-forge::jieba
# Website: https://github.com/fxsjy/jieba
import jieba

In [None]:
text = '姚明进入总决赛'
seg_list = jieba.cut(text)
print(", ".join(seg_list))

In [None]:
!python -m spacy download zh_core_web_sm

In [None]:
import spacy
nlp = spacy.load("zh_core_web_sm")
text = '姚明进入总决赛'
doc = nlp(text)
for token in doc: 
    print(token)

### Top-down (rule-based) tokenization - character tokenization

In [None]:
from spacy.lang.zh import Chinese
nlp_ch = Chinese()
print(*nlp_ch(text), sep='\n')

<h2 style="text-align: center;">Word Tokenization: BPE</h2>

### Byte-Pair Encoding: A Bottom-up Tokenization Algorithm
- It has been adopted from all modern LLMs including ChatGPT, GPT-series, and many others.

In [None]:
!conda install conda-forge::tiktoken --yes

In [None]:
# First of all, install GPT-4's tiktoken via: conda install conda-forge::tiktoken
import tiktoken
# Load an encoding
encoding = tiktoken.get_encoding("cl100k_base")
# Use tiktoken.encoding_for_model() to automatically load the correct encoding for a given model name.
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
print(encoding.encode("tiktoken is great!"))

In [None]:
# Count tokens by counting the length of the list returned by .encode().
def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [None]:
text = "tiktoken is great!"
print(f'\"{text}\" has been encoded into {num_tokens_from_string(text, "cl100k_base")} subwords')

In [None]:
# .decode() converts a list of token integers to a string.
encode_ids = [83, 1609, 5963, 374, 2294, 0]
print(f'the decoded string is: \"{encoding.decode(encode_ids)}\"')

In [None]:
text = """
Chapters 5 to 8 teach the basics of 🤗 Datasets and 🤗 Tokenizers before diving into classic NLP tasks.\
By the end of this part, you will be able to tackle the most common NLP problems by yourself. \
By the end of this part, you will be ready to apply 🤗 Transformers to (almost) any machine \
learning problem! E=mc^2. f(x) = x^2+y^2, print('hello world!’) baojianzhou. asdasfasdgasdg
"""
print(encoding.encode(text))

In [None]:
encode_ids = encoding.encode(text)
print(encoding.decode(encode_ids))

<h2 style="text-align: center;">Word Normalization, Lemmatization and Stemming</h2>

### Lemmatization (词形还原)

- Lemmatization is the task of determining that two words have the same root, despite their surface differences.
- **Motivation**: For some NLP situations, we also want two morphologically different forms of a word to behave similarly. For example in web search, someone may type the string woodchucks but a useful system might want to also return pages
that mention woodchuck with no s.
- **Example 1**: The words am, are, and is have the shared lemma be.
- **Example 2**: The words dinner and dinners both have the lemma dinner.

In [None]:
import spacy
text = """
The Brown Corpus, a text corpus of American English that was compiled in the 1960s at Brown University, \
is widely used in the field of linguistics and natural language processing. It contains about 1 million \
words (or "tokens") across a diverse range of texts from 500 sources, categorized into 15 genres, such \
as news, editorial, and fiction, to provide a comprehensive resource for studying the English language. \
This corpus has been instrumental in the development and evaluation of various computational linguistics \
algorithms and tools.
"""
text = text.replace("\n", " ").strip()

In [None]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print(doc[0], type(doc[0]))

In [None]:
lemmas = [token.lemma_ for token in doc]
for ori,lemma in zip(doc[:10], lemmas[:10]):
    print(ori, lemma)

### Stemming (词干提取): The Porter-Stemmer method

Lemmatization algorithms can be complex. For this reason we sometimes make use of a simpler but cruder method, which mainly consists of chopping off words final affixes. This naive version of morphological analysis is called stemming.

In [None]:
# spacy does not provide stemming
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

text = """\
This was not the map we found in Billy Bones's chest, but \
an accurate copy, complete in all things-names and heights \
and soundings-with the single exception of the red crosses \
and the written notes.\
"""   
porter_stemmer = PorterStemmer()
words = word_tokenize(text)
for word in words[:10]:
    print(word, porter_stemmer.stem(word))

<h2 style="text-align: center;">Sentence Segmentation</h2>

In [None]:
# Method 1: use nltk package
# Install nltk
import nltk
# Download the required models
nltk.download('punkt')  
from nltk.tokenize import sent_tokenize

In [None]:
text = "In the first part of the book we introduce the fundamental suite of algorithmic \
tools that make up the modern neural language model that is the heart of end-to-end \
NLP systems. We begin with tokenization and preprocessing, as well as useful algorithms \
like computing edit distance, and then proceed to the tasks of classification, \
logistic regression, neural networks, proceeding through feedforward networks, recurrent \
networks, and then transformers. We’ll also see the role of embeddings as a \
model of word meaning."
sentences = sent_tokenize(text)
for ind, sent in enumerate(sentences):
    print(f"sentence-{ind}: {sent}\n")

In [None]:
# Method 2: A modern and fast NLP library that includes support for sentence segmentation. 
# spaCy uses a statistical model to predict sentence boundaries, which can be more accurate 
# than rule-based approaches for complex texts.
# Install via conda: conda install conda-forge::spacy
# Install via pip:   pip install -U spacy
# Download data: python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Here is a sentence. Here is another one! And the last one.")
sentences = [sent.text for sent in doc.sents]
for ind, sent in enumerate(sentences):
    print(f"sentence-{ind}: {sent}\n")

In [None]:
# You need to install it via: python -m spacy download zh_core_web_sm
from spacy.lang.zh.examples import sentences 
nlp = spacy.load("zh_core_web_sm")
doc = nlp(sentences[0])
text = """\
时光荏苒，自 2003 年我师从吴立德教授，开启自然语言处理学习与研究之路，转眼已近二十\
载春秋。回想当年第一次听到自然语言处理的目标 ──“让机器理解人类语言”时的兴奋，第一次\
看到《大规模中文文本处理》教材时的茫然，仿佛黄萱菁教授对我研究生入学的电话面试就在昨\
天，每周与吴老师固定交流前的紧张感依然清晰。从求学到任教，深刻感受到自然语言处理的快\
速发展，从基于特征的统计机器学习方法到深度神经网络模型，再到大规模预训练方法，自然语\
言处理研究范式的更新迭代速度也在不断加快。在本科生和研究生的自然语言处理课程教学过程\
中，虽然通过不断补充国际国内的近期研究进展，将最新的理论和方法通过课件和面授的形式介\
绍给同学们，但是系统全面的书籍仍然是不可或缺的重要资料。于是，自 2020 年起与黄萱菁教授\
和桂韬研究员一起开始着手本书的准备，在经过几十次的讨论和大纲和结构反复修改后，自 2021\
年暑假起开始了本书的写作。2022 年本书入选复旦大学七大系列百本精品教材项目和复旦大学研\
究生规划系列教材项目，进一步督促我们加快进度。从规划到完成，历时近三年之久，这本拙作\
终于完成。"""
doc = nlp(text)
sentences = [sent.text for sent in doc.sents]
for ind, sent in enumerate(sentences):
    print(f"sentence-{ind}: {sent}\n")