## Speech and Language Processing

### An introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition

Daniel Jurafsky and James H. Martin. 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (1st. ed.). Prentice Hall PTR, USA.

### Tokenization 

Tokenization is a fundamental process in Natural Language Processing (NLP) that involves segmenting text into discrete units called tokens. The goal of tokenization is to break down the text into smaller, meaningful units that can be further processed by NLP algorithms. Tokens are typically words, but they can also be phrases, sentences, or even individual characters, depending on the level of granularity required for a particular application.

There are several different approaches to tokenization, depending on the specific requirements of the task at hand. One common method is to split text into tokens based on whitespace or punctuation. This approach works well for many applications, but it can be problematic for languages that don't use spaces to separate words (e.g. Chinese or Japanese).

Another approach to tokenization is to use natural language processing techniques to identify the boundaries between words. This can be done using machine learning algorithms, such as Hidden Markov Models or Conditional Random Fields, which are trained to recognize patterns in text that indicate the presence of words or other linguistic units. This approach is more accurate than simple whitespace or punctuation-based tokenization, but it requires more computational resources and may not be suitable for all applications.

Once text has been tokenized, the resulting tokens can be further processed by NLP algorithms for tasks such as part-of-speech tagging, named entity recognition, or sentiment analysis. Tokenization is a critical first step in these tasks, as the accuracy of subsequent NLP processing depends on the quality of the initial tokenization. As such, it is important to choose a tokenization approach that is appropriate for the specific task and language being processed.

In [4]:
import re
import nltk

nltk.download('punkt')
nltk.download('wordnet')

from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\vgama\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\vgama\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [5]:
text = """ 
        User: I am unhappy. 
        ELIZA: DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPY
        User: I need some help, that much seems certain.
        ELIZA: WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP
        User: Perhaps I could learn to get along with my mother.
        ELIZA: TELL ME MORE ABOUT YOUR FAMILY
        User: My mother takes care of me.
        ELIZA: WHO ELSE IN YOU FAMILY TAKES CARE OF YOU
        User: My father.
        ELIZA: YOUR FATHER
        User: You are like my father in some ways.
"""

In [8]:
word_tokenize(text.lower())

['user',
 ':',
 'i',
 'am',
 'unhappy',
 '.',
 'eliza',
 ':',
 'do',
 'you',
 'think',
 'coming',
 'here',
 'will',
 'help',
 'you',
 'not',
 'to',
 'be',
 'unhappy',
 'user',
 ':',
 'i',
 'need',
 'some',
 'help',
 ',',
 'that',
 'much',
 'seems',
 'certain',
 '.',
 'eliza',
 ':',
 'what',
 'would',
 'it',
 'mean',
 'to',
 'you',
 'if',
 'you',
 'got',
 'some',
 'help',
 'user',
 ':',
 'perhaps',
 'i',
 'could',
 'learn',
 'to',
 'get',
 'along',
 'with',
 'my',
 'mother',
 '.',
 'eliza',
 ':',
 'tell',
 'me',
 'more',
 'about',
 'your',
 'family',
 'user',
 ':',
 'my',
 'mother',
 'takes',
 'care',
 'of',
 'me',
 '.',
 'eliza',
 ':',
 'who',
 'else',
 'in',
 'you',
 'family',
 'takes',
 'care',
 'of',
 'you',
 'user',
 ':',
 'my',
 'father',
 '.',
 'eliza',
 ':',
 'your',
 'father',
 'user',
 ':',
 'you',
 'are',
 'like',
 'my',
 'father',
 'in',
 'some',
 'ways',
 '.']

In [10]:
text.lower().split()

['user:',
 'i',
 'am',
 'unhappy.',
 'eliza:',
 'do',
 'you',
 'think',
 'coming',
 'here',
 'will',
 'help',
 'you',
 'not',
 'to',
 'be',
 'unhappy',
 'user:',
 'i',
 'need',
 'some',
 'help,',
 'that',
 'much',
 'seems',
 'certain.',
 'eliza:',
 'what',
 'would',
 'it',
 'mean',
 'to',
 'you',
 'if',
 'you',
 'got',
 'some',
 'help',
 'user:',
 'perhaps',
 'i',
 'could',
 'learn',
 'to',
 'get',
 'along',
 'with',
 'my',
 'mother.',
 'eliza:',
 'tell',
 'me',
 'more',
 'about',
 'your',
 'family',
 'user:',
 'my',
 'mother',
 'takes',
 'care',
 'of',
 'me.',
 'eliza:',
 'who',
 'else',
 'in',
 'you',
 'family',
 'takes',
 'care',
 'of',
 'you',
 'user:',
 'my',
 'father.',
 'eliza:',
 'your',
 'father',
 'user:',
 'you',
 'are',
 'like',
 'my',
 'father',
 'in',
 'some',
 'ways.']