In [None]:
import re

## Lecture One
mainly following the Speech and Language Processing book
- Tokenization
- Regular Expressions
- Edit Distance
- BPE, what was before BPE?? what's the BPE paper? include it here

### What is text?
At the lowest level it's string or streams of characters (or bytes).

e.g. it could be an html page.

```
"<div class="md"><p>Depends on who it's for I guess? Gender? Age? Nationality? Might help people give you ideas. In general, I'd say something tartan maybe? House of Tartan sells blankets and scarves etc (along with lots of other generic Scottish stuff). If your gifts are for artsy people then you could maybe get some Charles Rennie Mackintosh related gifts (the shop in the Lighthouse sells loads of stuff like that - might be worth having a wander around there in general).</p></div>"
```

some of the characters are markup, we can use an HTML parser to remove the tags and focus on the main text. It's easier for JSON, HTML, XML but more difficult for PDF, etc...

Aboutness is a key concept in text analytics, e.g. identifying relevant to a search request, classifying a document about its content.

What is not important?
  filler words: on, who, are, I, be, a, there, you, etc..
  other words: general, something, maybe, ...depending on the context


Text processing as a pipeline:
- html
- ascii
- text
- vocab

```python
# get the text and trim it to desired content
html = urlopen(url).read()
raw = nltk.clean_html(html)
raw = raw[750:23506]

# get the tokens of interest from the raw text
tokens = nltk.wordpunct_tokenize(raw)
tokens = tokens[20:1834]
text = nltk.Text(tokens)


words = [w.lower() for w in text] # normalization
vocab = sorted(set(words)) # unique, and sorted (do we need it sorted??)
```

Most tasks in text processing require text normalization.

1. Segmenting/tokenizing terms in running text. e.g.
```
"House of Tartan sells blankets" -> "House", "of", "Tartan", "sells", "blankets"
```

- splitting on whitespace: `" " TAB \n \r`
- splitting on punctuation: `_-.?!,;:"()'&£$`

The above is accomplished via regex, i.e. splitting.

other languages may be more difficult to tokenize. e.g. japanese

2. Normalize or 'canonicalize' tokens into a normal form
may "case fold" aka lower case text, remove numbers, punctuation, etc..
e.g.
```
"House", "of", "Tartan", "sells", "blankets"
"house", "of", "tartan", "sell", "blanket"
```

3. Segment long sequences of tokens (usually into sentences)
build a binary classifier (later) and/or use heuristic rules
```
e.g. [Might, help, people, give, you, ideas, .I'd, say, something, tartan, maybe, ?]
-> <s>Might help people give you ideas.</s><s>I'd say something tartan maybe?</s>
```

##### Text Processing Summary
Tokenization is easy but hard to do well.

Tokenization complications:
- Contractions: what're
- Numbers: 555,555.00
- Multiword expressions: rock'n' roll
- useful puntuation: m.p.h. Ph.D., AT&T
- URLs: http://www.google.com
- Runaway tokens: #mliscoolweneedtolearnitnow

The Pen Treebank defines one common tokenization standard used in NLP tasks.

Text normalization (and ususally stemming) are almost always required.

Tokenization and normalization are language specific.


### Regular Expressions, Tokenization, Edit Distance (maybe/?)
ELIZA was a chatbot like program using pattern recognition phrases like "I need X" and translate into suitable output like "what would you do if you got X?". It didn't know anything about the world, it was like a listener that acts like knowing nothing.

One tool to describe text pattern is **regular expression**, for example to extract strings from document.
**text normalization** means converting it to a more convenient standard form.
**tokenization** separating words or word parts from the text document, e.g. english words are separated by whitespace (not always sufficient).
for tweets for example we'd need to tokenize the **emoticons** :) or **hashtag** #nlp.
some languages are hard to tokenize as they don't have spaces, e.g. japanese.
we'd need sometimes to tokenize subwords, short phrases, letters for large LLMs.
another part of text normalization is **lemmatization**, the task of determining that words have the same root, e.g. *sang*, *sung*, *sings* are form of the verb *sing*. *sing* is the common *lemma* of these words. A **lemmatizer** maps from all these to *sing*.
**stemming** is a simpler form of lemmatization in which we just strip suffixes from the end of the word.
Text normalization also includes **sentence segmentation**, breaking text into sentences using cues like periods or exclamation.
The metric **edit distance** measures how similar two strings are based on number of edits (insertion, deletion, substitution) it takes to change one string into another.

#### Regular Expressions
regex is a language for specifying text search strings, used e.g. in `grep`, `vim`.
very useful when we have a **pattern** to search for in a **corpus** to search through.
a regex will search through the corpus returning all the pattern matches.
the corpus can be a single document or a collection.
we'll describe the **extended regular expressions**.

regular expression patterns are case sensitive:
- concatenation: putting characters in sequence is concatenation, e.g. `/woodchucks/`, `/ubaid/`
- range: using the `[-]` (use of [] is required) to indicate from to a range, e.g. `/[A-Z]/`, `/[0-9]/`, `/[a-z]/`
- Kleene*: * to say how many of something, means zero or more occurrences
- Kleene+: at least one
- wildcard: `.` matches any one character
- anchors: anchor the regular expressions to a particular place in a string
  - `^`   start of line
  - `$`   end of line
  - `\b`  word boundary
  - '\B`  non-word boundary
- **disjunction** operator: `|` specifies either or, e.g. `dog|cat`
- enclosing the sequences with `()` we basically make it like a single character
regualar expressions are **greedy** (match as large as possible) by default but we can use **non-greedy** Kleene operators (match as little as possible). `*?` or `+?`

an example is trying to match all the words `the` in a document
we start with /the/ but we're missing beginning of text for example `The`
so we do this /[Tt]he/
but that we'll also match `other` `there`
so we need word boundary like /\b[Tt]he\b/


^ the above proccess introduces **false positives**, i.e. matching `other` `there`
and also false negatives, i.e. missing correct strings

reducing the overall error of the application involves two antagonists efforts:
- increasing **precision** (minimizing false positives)
- increasing **recall** (minimizing false negatives)


an important part of regular expression is in **substitutions**. e.g. in python or in vim the sub operator like `s/colour/color`

you can use the number operator \1 to refer to a matched pattern back.
this use of parenthese to store a pattern in memory is called **capture group**, every time a capture group is used, the resulting match is stored in a numbered **register**. And you refer to the captured group via numbers like 1, 2, 3, etc..

if we want to use parenthese to not capture the group we can use a non-capturing group via /(?:some|a few) (people|cat) like some \1/
in the example above `\1` would refer to the second parenthese group, i.e. `(people|cat)`

Recall ELIZA, that is using a series of cascade regular expression substitution.
e.g. some are these
```txt
s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
```



In [None]:
document = "Thethis * Column 1      Column 3 Column 32423 dogy begin other cat the begun w weeee is 223234 our document we'll search through to find patterns, this is random, will we find anything useful?? We 2"
concatenation_pattern = "we"
sensitive = "We"
sensitive_solved = "[Ww]e"
any_single_digit = "[1234567890]"
range_pattern = "[0-9]" # similar to the above one

# can use the [] to specigy that the character cannot be by using the ^ at the beginning
cannot_be_pattern = "[^a]" # matches any single character expect the character a

# we can use the ? to indicate the preceding or nothing, i.e. zero or one instance
preceding_or_nothing = "we?"

"""
/[^A-Z]/    not an upper case letter
/[^Ss]/     one character neither S or s
/[^.]/      not a period
/[e^]/      either e or ^
/a^b/       match this pattern specifically a^b
/colou?r/   color or colour
"""

# zero or more occurrences are defined using the Kleene*
zero_or_more = "we*"
# can also do zero or more of complex patterns
zero_or_more_complex = "[ab]*"
# to specify at least one we can use the plus +
at_least_one_pattern = "we+"
another_least_one_pattern = "[0-9]+" # at least a digit

wildcard_character_pattern = "beg.n"

# ^ anchor for beginning of line
beginning_of_line_pattern = "^The"

boundary_word_pattern = "\bthe"

either_or_pattern =  "dog|cat"

act_like_single_character = "dog(ggy|y)"

columns_pattern = "(Column [0-9]+ *)*"


# exactly this many occurrences {} and can also specify ranges with {m,n}
exactly_this_many_occurrences = "[Ww]e{4}"
range_of_occurrences_pattern = "[Ww]e{1,4}"

# special characters like the newline are referred to by using the `\` or even `*` e.g. /\*/
newline_pattern = "\n"
asterisk_pattern = "\*"


found_patterns = re.findall(concatenation_pattern, document)
print(found_patterns)
sensitive_matches = re.findall(sensitive, document)
print(sensitive_matches)
sensitive_solved_matches = re.findall(sensitive_solved, document)
print(sensitive_solved_matches)
digit_matches = re.findall(any_single_digit, document)
print(digit_matches)
range_matches = re.findall(range_pattern, document)
print(range_matches)
anything_but_a_matches = re.findall(cannot_be_pattern, document)
print(anything_but_a_matches)
preceding_or_nothing_matches = re.findall(preceding_or_nothing, document)
print(preceding_or_nothing_matches)
zero_or_more_matches = re.findall(zero_or_more, document)
print(zero_or_more_matches)
zero_or_more_complex_matches = re.findall(zero_or_more_complex, document)
print(zero_or_more_complex_matches)
at_least_one_pattern_matches = re.findall(at_least_one_pattern, document)
print(at_least_one_pattern_matches)
at_least_a_digit_matches = re.findall(another_least_one_pattern, document)
print(at_least_a_digit_matches)
any_one_character_wildcard_matches = re.findall(wildcard_character_pattern, document)
print(any_one_character_wildcard_matches)
beginning_of_line_matches = re.findall(beginning_of_line_pattern, document)
print(beginning_of_line_matches)
boundary_word_matches = re.findall(boundary_word_pattern, document)
print(f"boundary_word_matches {boundary_word_matches}")
either_or_matches = re.findall(either_or_pattern, document)
print(either_or_matches)
precedence_matches = re.findall(act_like_single_character, document) # doesn't seem to do what I expected
print(precedence_matches)
column_matches = re.findall(columns_pattern, document)
print(column_matches)
exactly_this_many_occurrences_matches = re.findall(exactly_this_many_occurrences, document)
print(exactly_this_many_occurrences_matches)
range_of_occurrences_matches = re.findall(range_of_occurrences_pattern, document)
print(range_of_occurrences_matches)
asterisk_matches = re.findall(asterisk_pattern, document)
print(asterisk_matches)

In [None]:
# NOTE this could be given as an exercise

# more complex example
initial_pattern = "$[0-9]+" # a dollar followed by a digit to get the price
# now let's a decimal point and two digits aftwerwards
initial_pattern = "$[0-9]+\.[0-9]{2}" # in the book it's like this instead $[0-9]+\.[0-9][0-9], is it not the same??
# the above pattern matches $199.22 but not $199
initial_pattern = "$[0-9]+(\.[0-9]{2})?" # remember `?` which means zero or one
# 


#### Words
what does count as a word? we can decide to treat punctuation as a separate word or not depending on the task, e.g. part of speech tagging.
**utterrance** is the spoken correlate of a sentence. 
e.g. "I do uh main- mainly business data processing"
there are two kinds **disfluencies**. `main-` broken word is called **fragment**
where uh um are called **fillers** or **filled-pauses**. should we consider these as words? depends on the application.

to understand better what counts as a word we need to understand **word types**, i.e. number of distinct words in the corpus.
word **instances** are the total number of N of running words, equivalent of word tokens in the past.

do we consider `They` and `they` as two word types or the same? it depends on the task. e.g. for speech recognition same is fine.

the relationship between the word type and the word instance is referred to by **Herdan's law** or **Heaps' law**.

cats and cat are two differenct **wordforms** but have the same **lemma**. A lemma  is a set of lexical forms having the same stem. The **wordform** is the full inflected or derived form of the word. 

for many LLMs we actually use **tokens** using the **tokenization** process. The token can be a word or a part of the word.

#### Corpora
there are variations genre of text. e.g. from telephone conversations, business meetings, medical interviews, etc..
to understand what a corpus was meant for is thanks to **datasheet** or **data statements** that includes:
- motivation for collecting the corpus
- situation: when and in what situation was text written/spoken
- language variety: what language was the corpus in?
- speaker demographics: what was e.g the age, sex of the text's authors?
- collection process: how big is the data? if it is a subsample how was it sampled? was the data collected with consent? how was the data preprocessed? and what metadata is available?
- annotation process: what are the annotations, how was the data annotated? how was the annotation process?
- distribution: are there copyright or other intellectual property restrictions?


#### Text Normalization
before any natural language processing of a text, the text has to be normalized through the **text normalization** process which involves:
- tokenization (segmentation) words
- normalizing word formats
- segmenting sentences


NOTE: ok at this point we can start collecting the data and do some of text normalization process in a jupyter notebook.


**NOTE what happens when you have a corpus with lots of examples of one label??? we need to explain it in our lectures**

**the other thing we need to deal with the preprocessing or processing of our corpus for our task is the handling of links, we could potentially neglect them, i.e. remove them with a regular expression so we basically showcase the use of how regex are used, we could do that in one type of classification or in the other one we could just keep it or replace with a placeholder like [LINK], we need to keep the script so that it has both versions, for now we don't do anything special.**

- For the above mentioned we should compare the performance of the different solutions.
- Create features engineers.
- Create synthetic data after training the model.
- We could also have one model trained by giving the subject as one of the features.
- We've also included the has_attachment feature and the subject.
- Ideally also when we create the synthetic data we want to be able to generate these features too so in the lecture descriptions we can explain how the original data looked like.
- We should train with capital or lowercase only and se the diffrence and show the difference.**

**how do we deal with the fact that our corpus doesn't have all the words that exists in english vocabulary???**




after having run the `../scripts/process_linkedin_messages.py`, we now try to apply the lessons from the first lecture to process the text.

Let's first create a corpus by combining all the messages into a one large document. Then extract all the words in the documnet to form the vocabulary.

First thing we need to do is to replace many spaces in the individual messages to just one space. The reason we have many spaces is cause of the way we collected the data and put into `scripts/messages.py`.


By using the pattern `\S+` we basically are defininig and implementing a simple version of **tokenization**.
For now, we count as a word any non-whitespace consecutive characters.

NOTE: we haven't preprocessed the text in any way, i.e. we haven't lowercased anything, taken the lemmas, etc..


The below code is a simple example of a **token learner**.



In [None]:
from collections import defaultdict

messages_dataset_filepath = "../datasets/messages.csv"

def simple_token_learner():
  words = defaultdict(int)
  with open(messages_dataset_filepath, "r") as f:
    any_non_white_space_character_pattern = "\S+"
    lines = f.readlines()

    #print(f"lines={lines}")
    for i, line in enumerate(lines):
      #print(f"line={line}")
      if i == 0: continue # skip the first csv header row
      block = int(line[0])
      content = line[2:]
      #print(f"type of block is {type(block)}") 
      #print(f"content={content}")
      any_non_white_space_matches = re.findall(any_non_white_space_character_pattern, content)
      for word in any_non_white_space_matches:
        words[word] += 1
      #print(f"words={any_non_white_space_matches}")

  # sort the words by the most frequent one first
  words = dict(sorted(words.items(), key=lambda item: item[1], reverse=True))
  for word, freq in words.items():
    print(f"word={word} {freq}")
  #print(words)
  print(f"total words are {len(words)}")
  print(f"Ubaidullah, freq = {words['Ubaidullah,']}")

  return words



words = simple_token_learner()

#### Word and Subword Tokenization
In NLP, we usally break words into **subword tokens**, which can be words or part of words or individual letters.

Tokenization is run before any other language processing.

**Top Down (Rule Based) Tokenization**
In NLP, we usually keep the punctuations and numbers. Then we need to account for hashtags, urls, emails, dates, special chars in words like AT&T, prices (e.g. $45.45), etc...

Can use tokenizer to expand **clitic** contractions marked by apstrophes, e.g. `what're` into `what are`.
A clitic is a part of word that cannot stand on its own.

Tokenization is tied with **named entity recognition**, task of detecting names, dates and organizations.

A common tokenization standard is the **Penn Treebank Tokenization**, where `doesn't` becomes `does n't`.

Word tokenization is more comlex in Chinese and Thai where there are no spaces and the words are composed of characters called **hanzi** (Chinese).
Each character represents a single unit of meaning (**morpheme**).

For some languages like thai we need more than one character to use as a word.

**Byte-Pair Encoding: A Bottom Up Tokenization Algorithm**
We can use the data to tell us what the words should be unlike the previous approach where we either used a character or whitespace or sth more complex. This is very useful to deal with unknown words, very common in NLP.
In NLP, algos learn facts from one corpus (**training** corpus) and use theses facts to make decisions about separate **test** corpus, hence the problem of unknown words.

To solve this problem, tokenizers try to induce **subwords** tokens.

Most tokenizer scheme have 2 parts:
- **token learner**: takes raw training corpus and induces a vocabulary, a set of tokens.
- **token segmenter**: takes raw test sentence and segments it into the tokens vocabulary.

Two algorithms are used:
- **byte pair encoding** (Sennrich et al., 2016)
- **unigram language model** (Kudo, 2018)

**SentencePiece** (Kudo and Richardson, 2018a) has both implementation, but SentencePiece is usually referred to mean as **unigram language model**.


#### BPE
- begin with a vocabulary that is all the individual characters.
- examine training corpus
- choose the two symbols that are most frequently adjacent, e.g. 'A' and 'B'
- add the merged symbol 'AB' to the vocabulary
- replace every adjacent 'A' 'B' with 'AB'
- continue to count and merge creating longer and longer character strings, until k merges have been creating k novel tokens
- k thus becomes the parameter of the algorithm
- the resulting vocabulary consists of the original characters plus k new symbols.

TODO
- implement the BPE algorithm in ocaml
- go through the karpathy implementation of BPE in its video https://youtu.be/zduSFxRajkE


Another tokenization is the Wordpiece which is similar to BPE but uses different merge stratedfy based on likelihood. It's used in BERT and RoBERTa. Slightly more complex than BPE.



#### Word Normalization, Lemmatization and Stemming
The simplest case of word normalization is **case folding**. e.g. mapping everything to lowercase like `Woodchuck` and `woodchuck` are represented identically, which is very helpful for generalization in tasks like information retrieval or speech recognition.

For sentiment analysis and other text classification tasks, information extraction, machine translation instead case folding is generally not done.

If you use BPE you may not need to do any other normalization.


**lemmatization**
is the task of determining that two words have the same root. e.g. one application of it could be `He is reading detective stories` -> `He be read detective story`

How is lemmatization done?
**morphology** is the study of the of the way words are built up from smaller meaning-bearing units falled **morphemes**.
**stem** is the central morpheme of the word, gives the main meaning
**affixes** adding additional meanings of various kind.
e.g. fox is one morpheme; cats is two `cat` and `s`

Lemmatization algos can be complex, hence we can use a simpler morphological analysis called **stemming**, i.e. chopping off the final affixes.
**Porter Stemming** consists of rewrite rules run in a series. Not commonly used now cause of overgeneralization and undergeneralization.





In [None]:
# very simple Porter Stemming implementation (do not use in production)
def porter_stemming(tokens:list[str]):
  # Rules ordered by longest suffix first to ensure correct application
  rules = [
    ("sses", "ss"),
    ("ies", "i"),
    ("ss", "ss"),
    ("s", ""),
    ("ational", "ate"),
    ("izer", "ize"),
    ("ator", "ate"),
    ("al", ""),
    ("able", ""),
    ("ate", ""),
    # there are more complex not covered here
  ]

  stemmed_tokens = []
  for token in tokens:
    stemmed = token
    for suffix, replacement in rules:
      if stemmed.endswith(suffix):
        stemmed = stemmed[:-len(suffix)] + replacement
        break
    stemmed_tokens.append(stemmed)
  return stemmed_tokens


tokens = ["caresses", "ponies", "caress", "cats", "running", "horses", "relational", "digitizer", "operator", "revival", "adjustable", "activate"]
expected = ["caress", "poni", "caress", "cat", "running", "horse", "relate", "digitize", "operate", "reviv", "adjust", "activ"]
output = porter_stemming(tokens)
assert output == expected, f"expected {expected} but got {output}"

In [None]:
# BPE implementation
# is words text or is it just actual words can we give it text instead, I feel like it's just text rather than words
def get_unique_chars(words):
  print(words)
  vocabulary = set()
  for word in words:
    for c in word:
      vocabulary.add(c)
  return vocabulary




# https://github.com/karpathy/minbpe/blob/master/minbpe/base.py#L13
def get_stats(words:list[list[str]]):
  word_counts = defaultdict(int)
  for word in words:
    for pair in zip(word, word[1:]):
      word_counts[pair] += 1
  return word_counts


# do we need to remove it from vocab?? I don't think so but then how do we do it??
# hmm we need to update the corpus after the merge of the tokens into the vocab
# so the question is how do we set up the vocab in first place??
def merge_most_frequent_tokens_and_get_them(stats, vocabulary):
  tokens, freq = stats[0]
  token1, token2 = tokens
  vocabulary.add(token1+token2)
  return token1, token2, token1+token2

def update_corpus(tokenized_corpus, new_token):
  new_corpus = []
  for word in tokenized_corpus:
    new_word = []
    i = 0
    while i < len(word):
      # with len(word) - 1 we make sure that there are enough tokens in the future to look for
      if i < len(word) - 1 and word[i] + word[i+1] == new_token: # these two tokens basically are forming the new_token
        new_word.append(new_token)
        i += 2 # skip the next token as it's merged
      else:
        new_word.append(word[i])
        i += 1
    new_corpus.append(new_word)
  return new_corpus 


def bpe(corpus, k_merges=10):
  merges = []
  #print(f"corpus={corpus}")
  tokenized_corpus = [list(word) for word in corpus] # [[c], [i], [m], [k], [g]]
  print(f"tokenized_corpus={tokenized_corpus}")
  # count the adjacent words and get the most frequent one
  vocabulary = get_unique_chars(tokenized_corpus)
  #print(len(vocabulary))
  #print(vocabulary)

  for k in range(k_merges):
    stats = get_stats(tokenized_corpus)
    #print(stats)
    # now sort the stats and get the most frequent occurrence
    stats = list(sorted(stats.items(), key=lambda item: item[1], reverse=True))
    #print(stats)

    token1, token2, new_token = merge_most_frequent_tokens_and_get_them(stats, vocabulary)
    #print(len(vocabulary))
    #print(vocabulary)
    print(f"most frequent tokens are {token1, token2}")
    merges.append(new_token)

    tokenized_corpus = update_corpus(tokenized_corpus, new_token)
    #print(f"updated_corpus={tokenized_corpus}")

  return vocabulary, merges






In [None]:
vocabulary, merges = bpe(words, 100)
print(f"vocabulary after bpe = {vocabulary}")
print(f"merges={merges}")

#### Sentence Segmentation
We've learned our vocabulary, we can now use a token segmenter.

The token segmenter run just on the merges we have learned from the training data on the test data.

Usually by using punctuation, e.g. periods, question marks, exclamation. Period is more ambiguous.
Sentence tokenization woks by deciding (machine learning or rule) whether period is part of word or is sentence boundary marker, an abbreviation dictionary can help find abbreviations.

In [None]:
def token_segmenter(merged_vocabulary, sentence, k_merges=10):
  words = sentence.split(" ")
  tokenized_sentence = [list(word) for word in words]
  print(f"tokenized_sentence = {tokenized_sentence}")
  tokens = []
  for _ in range(k_merges):
    for word in tokenized_sentence:
      for pair in zip(word, word[1:]):
        potential_token = pair[0]+pair[1]
        #print(f"potential_token={potential_token}")
        if potential_token in merged_vocabulary: # need to change the name to merged_vocabulary
          tokenized_sentence = update_corpus(tokenized_sentence, potential_token) # what goes into the new_token??
  return tokenized_sentence


In [None]:
sentence = "Of course in real settings BPE is run with many Ubaid thousands of merges on a very large input corpus. The result is that most words will be represented as full symbols, and only the very rare words (and unknown words) will have to be represented by their parts."


sentence_tokens = token_segmenter(merges, sentence, 100)
print(sentence_tokens)



##### Minimun Edit Distance
In NLP one common task is to measure how similar two strings are, e.g. graffe with giraffe.
Another example is **coreference**, i.e. decide whether two strings refers to the same entity.
e.g.
```
Stanford Arizona Cactus Garden.
Stanford University Arizona Cactus Garden.
```
another task of strings similarity is in the quality measure of transcription produced by a speech recognition system, words that differ by a lot have worse quality transcription and those that differ by a few have better quality.

**Edit distance** gives us the technique to quantify these intuitions about similarity.
**Minimum edit distance** is defined as the minimum number of editing operations (insertions, deletions, substitutions) needed to transform one string into another.
We can even assign even a cost to these operations when doing alignment.
The **Levenshtein** distance between two sequences is in which each of these three operations has a cost of 1.

We could do case folding but for the purpose of this series we don't need to do case folding but keep the representation as it is.

We'll now extract the text and put it into a file and this would become our corpus for the next lectures or we could use the shakespeare corpus.

In [None]:
def extract_corpus_from_messages():
  messages_dataset_filepath = "../datasets/train.csv"
  corpus_filepath = "../datasets/messages_corpus.txt"

  with open(messages_dataset_filepath, "r") as messages_file:
    lines = messages_file.readlines()

    with open(corpus_filepath, "w", newline="") as corpus_file:
      for line in lines[1:]:
        text = line[2:]
        print(text)
        corpus_file.write(text)
        



extract_corpus_from_messages()

what is the **datasheet** of our application blokedin? include it here and assign it as an exercise.

TODO:
- implement better more real world BPE (ocaml) and reference it in this lecture



##### Summary
we covered
- regular expression: a powerful tool for pattern matching
- **concatenation** of symbols, **disjunction** ([], |), **counters** (*, +, {n,m}), **anchors** (^, $) and precedence operators ((,)).
- **word tokenization and normalization** 
- **Porter** simplest algorithm for stemming
- **minimum edit distance** using **dynamic programming** and **alignment** of two strings

### TAD Course Objectives

when each of these done, mark them as accomplished.


- **Introduction to text**: Tokenization, vector distribution, cosine similarity, and lemmatization
- **Distributions and clustering**: Term distribution, TF-IDF, usupervised text clustering
- **Language modeling**: Basic probability, language models as text, smoothing, probabilistic documents similiraty
- **Word embeddings**: Dense text representations and word embeddings
- **Text classification**: Classification and regression, naive bayes, support vector machines, & information theory
- **Natural Language Processing (NLP)**: Sequence tagging and structured parsing
- **Evaluation**: Metrics, cross-fold validation, and best practices in model design
- **Advanced clustering**: LSI and visualization
- **Applications I**: information extraction and question answering
- **Applications II**: Dialogue systems and chatbots

##### Tech
- NLTK, Spacy: for text analysis
- Scikit (examples tab, user guide tab): for machine learning
- pandas, numpy, scipy: for data analysis
- gensim: for topic and word modeling
- TODO add anything else that will be used here
- 




### Quiz 

> Tokenization is simply the process of splitting text into words on space characters?
- True
- False

> A one hot encoding records the frequency of each word in a piece of text?
- True
- False

> We must store the offset of words in the vectors using a dictionary in order to implement one-hot encoding?
- True
- False

> Why is stemming useful?
1. it allows matching of words that sound the same
2. it allows matching of words with morphological variations
3. it allows matching of missplled words

> A vocabulary in a one hot encoding tipycally includes:
1. All unique tokens in the text collection
2. All normalized unique tokens in the text collection
3. All normalized unique tokens with their counts
4. All unique tokens with their lemmas


