# Let's Tokenize Sentences

## But first, why tokens?
Short answer: API vendors charge based on usage, by the token (kilo-Token or mega-Token)

Long answer: Language models like BERT and GPT count and process text using tokens rather than words for a few key reasons:
1. Words don't capture morphology - Tokens allow accounting for prefixes, suffixes, and word forms that expand the vocabulary. This improves generalization.
1. Non-word tokens are useful - Tokens can include punctuation, special symbols, numbers, etc. which carry meaning.
1. Subword tokenization - Splitting rare words into subword units via tokenization like WordPiece improves vocabulary coverage.
1. Vocabulary size limits - Models have a predefined vocabulary size based on memory constraints. Tokens maximize what can fit.

## spaCy cheatsheet
https://www.datacamp.com/cheat-sheet/spacy-cheat-sheet-advanced-nlp-in-python

## How to tokenize sentences
First, install spaCy, and then the english language model:

In [None]:
# python -m pip install spacy
# python -m spacy download en_core_web_sm

import spacy
import json
import pprint

# https://spacy.io/api/top-level#spacy.prefer_gpu
# Allocate data and perform operations on GPU, if available. 
# If data has already been allocated on CPU, it will not be moved. 
# Ideally, this function should be called right after importing spaCy and before loading any pipelines.
spacy.prefer_gpu()

# pprint.pprint(spacy.info())
print(json.dumps(spacy.info(), indent=4))

nlp = spacy.load("en_core_web_sm")

doc = nlp(
    "This is the first sentence that you will tokenize."
)
type(doc)
[token.text for token in doc]


output:
```
[
'This',
'is',
'the',
'first',
'sentence',
'that',
'you',
'will',
'tokenize',
'.'
]
```    
   9 words, but 1 doc. 

In [None]:
doc = nlp(
    "This shouldn't have been the first sentence that you will tokenize. Better if wasn't third, also."
)
type(doc)
print('{} doc'.format(len(doc)))
[token.text for token in doc]

16 words and 21 tokens

```
[
 'This',
 'should',
 "n't",
 'have',
 'been',
 'the',
 'first',
 'sentence',
 'that',
 'you',
 'will',
 'tokenize',
 '.',
 'Better',
 'if',
 'was',
 "n't",
 'third',
 ',',
 'also',
 '.'
 ]
 ```

In [None]:
import pathlib
file_name = "wiki_spacy.txt"
doc = nlp(pathlib.Path(file_name).read_text(encoding="utf-8"))

print ([token.text for token in doc])

# sentences
sentences = list(doc.sents)
len(sentences)

for sentence in sentences:
    print(f"{sentence[:5]}...")


# language detection



In [None]:
from spacy.language import Language
from spacy_langdetect import LanguageDetector
import spacy
# Ideally, this function should be called right after importing spaCy and before loading any pipelines.
spacy.prefer_gpu()

@Language.factory("language_detector")
def get_lang_detector(nlp, name):
   return LanguageDetector()

nlp = spacy.load('en_core_web_sm')  # 1
Language.factory("language_detector", func=get_lang_detector)
nlp.add_pipe('language_detector', last=True)


In [None]:
unknown_language = "Er lebt mit seinen Eltern und seiner Schwester in Berlin."
doc = nlp(unknown_language)
detect_language = doc._.language 
print(detect_language)

unknown_language = '왜 내 주변에는 맛있는 한식당이 없지?'
doc = nlp(unknown_language)
detect_language = doc._.language 
print(detect_language)

# sentences woo hoo!
print ([token.text for token in doc])