# Lab 1

Tha aims of the lab are to:
- [ x ] implement a simple tokenizer (just consecutive characters that form words) and compare it to the nltk treebank word tokenizer
- [] extract the counts of the vocabulary and the tokenizer
- [ x ] implement a simple Porter stemmatizer and compare it to the nltk one and see if there are differences
- [ ] implement the BPE tokenizer
- [ ] ...

In [None]:
!pip install nltk

In [None]:
from nltk.tokenize import RegexpTokenizer
import pandas as pd

## Load and process Linkedin messages data

This is the data of messages received via Linkedin collected manually and put into a json structure in a python file and processed into a csv file via the script at `scripts/process_linkedin_messages.py`.

The csv has four columns:
- `block`: a binary `0` or `1` indicating whether we should block the sender or not
- `content`: conataining the content of the message received.
- `subject`: contains the subject of the message received
- `has_attachment`: a binary `0` or `1` indicating whether the message received contained an attachment


To follow along you could recreate a similar structure and run it through the `scrips/process_linkedin_messages.py` to create your dataset in a csv file format. We'll use that csv file and load it into a Pandas dataframe to do other NLP tasks.




In [None]:
def load_corpus(corpus_filepath):
  dataset = pd.read_csv(corpus_filepath)
  return dataset

In [None]:
messages_corpus_filepath = "../datasets/messages.csv"

messages_dataset = load_corpus(messages_corpus_filepath)

print(f"total messages in the dataset: {len(messages_dataset)}")



Now that we've loaded our messages dataset, let's analyze the data.

#### Your task: 
Use the head() function to print out the top few samples of the Linkedin messages.



In [None]:
messages_dataset.head()

We can see that some of the messages do not contain the subject and they are represented as `NaN` (Not a Number) value in the cell.

TODO in the future we'd need to deal these in some way when using this column as a feature in our model.



Let's now print a statistical summary of the data using `describe()`

In [None]:
messages_dataset.describe(include="all").transpose()

What do we see? Do most messages contain the subject? Are all the subjects unique? Do most of the messages contain an attachment?

### Iteration

To iterate through DataFrame's row in pandas one can use:

- DataFrame.iterrows() or DataFrame.itertuples()

*Question* Which is faster? 
 - We will time them using timeit
 - We will demonstrate how to use iteration with a "for comprehension".
 
#### Your Task
Execute the code below and read it to understand how it works. In particular, notice the selection of fields from the DataFrame

In [None]:
# Note the use of itertools which has many useful iteration utilities.
from itertools import islice

N = 5

# print the first N rows using both methods
# iterrows
for index, row in islice(messages_dataset.iterrows(), N):
  print(index, row["content"], row["block"])

# itertuples
print(f"now itertuples")
for row in islice(messages_dataset.itertuples(index=True, name="Pandas"), N):
  print(getattr(row, "content"), getattr(row, "block"))


N = len(messages_dataset) # I don't have a bigger dataset, we could use the imdb in case
# iterrows
time1 = %timeit [row["content"] for index, row in islice(messages_dataset.iterrows(), N)]

time2 = %timeit [getattr(row, "content") for row in islice(messages_dataset.itertuples(index=True, name="Pandas"), N)]

We can see that `itertuples` is faster by some margin.


We can index into the dataframe, by asking for particular rows:
- `iloc` allows us to ask for particular row(s) indexed by (integer) position
- `loc` allows us to ask for particular row(s) indexed by a label 

By default labels are automatically assigned, starting from 0, so the two statements are indentical for us.

Each row in the dataframe is a Series type, a one dimentional labelled array holding any data type.



In [None]:
print(type(messages_dataset))

In [None]:
print(messages_dataset.iloc[0])

In [None]:
print(messages_dataset.loc[1])

In [None]:
print(type(messages_dataset.iloc[0]))

## Collection statistics

Let's learn more about the Linkedin Messages collection statistics. 

#### Your Task
Use count() on the `messages_dataset` object to print a count distribution. 
- Note: count() gives count values for each column in the frame independently

In [None]:
messages_dataset.count()

We should have around 103 messages (depends on how many you collected).
- Do all columns have the same count? What does this tell us about the collection?

Now, let's dig and explore more statistics on the messages by their text content.

First, let's select all the values of the *text* column and inspect its type.


In [None]:
messages_contents = messages_dataset["content"]
print(type(messages_contents))
messages_contents.head()

It's a Series, a row in a DataFrame. We will use a useful function on the Series object, `value_counts` to group and count the values for the `content` series.

#### Your task:
Use value_counts on the `messages_contents` variable.
 - Print a statistical summary of the data using .describe() 
 - Use head() to print out the top 5 subreddits with their count


In [None]:
messages_counts = messages_contents.value_counts()
messages_counts.head()

^ Since all our messages are unique we'll see the count of 1 for each message.

- What information is provided by the describe() function?
- What does the statistical summary tell you about the frequency distribution of Linkedin Messages?

TODO~~Consider what was discussed in lecture 2 about word distributions; this distribution follows a similar pattern. This is typical of real-world text data and will have important ramifications later in the course.~~

In [None]:
messages_counts.describe()

Let's now get some more interesting statistics using the value_counts function.

What is the percentage of the messages that were blocked and the ones not blocked?

In [None]:
blocked_messages = messages_dataset["block"]
blocked_messages_counts = blocked_messages.value_counts()
blocked_messages_counts


In [None]:
blocked_messages_counts.describe().transpose()

TODO confirm these questions make sense
- What does the `blocked_messages_counts` variable represent?
- Critically look at these statistics
- ~~What is the shortest message, longest message~~
- ~~How many posts are in a 'typical' message~~

In [None]:
non_blocked_messages_percentage = blocked_messages_counts[0] / blocked_messages_counts.sum()

print(f"percentage of non blocked messages in the messages dataset: {non_blocked_messages_percentage}")

In [None]:
blocked_messages_percentage = blocked_messages_counts[1] / blocked_messages_counts.sum()
print(f"percentage of blocked messages in the messages dataset: {blocked_messages_percentage}")

## Text processing of Linkedin messages

We will now take our first steps processing the Linkedin messages as text. We will apply an NLTK text processing pipeline to the messages.

### NLTK

[NLTK](http://www.nltk.org/) is a large compilation of Python NLP packages. It includes implementations of a number of classic NLP models, as well as utilities for working with linguistic data structures, processing text, and managing corpora.

Later we may use spaCy.

Let's import the library.

In [None]:
import nltk

### Step 1: Tokenization
We will try tokenization ourselves using NLTK's tokenizers. You may find the documentatioon of the [tokenize package](https://www.nltk.org/api/nltk.tokenize.html) informative.


#### Your task

Tokenize the body field of a message using Regular Expression and Treebank tokenizers (recall this as standard tokenizer from Penn Treebank discussed in lecture 1) and compare them:

1. Import the Reg Exp Tokenizer from NLTK.
2. Create a regular expression tokenizer that uses a simple pattern -- a sequence of one or more "word characters".
3. Create a tokenizer that tokenizes using the (Penn) Treebank Word Tokenizer. Find the right tokenizer in the package documentation.
4. Tokenize the sample message below using each of the tokenizers and print the resulting tokens for each.
5. Inspect and compare the output of the tokenizers.


In [None]:
# a sample message to tokenize
text = messages_dataset.iloc[69]['content']
print(text)

In [None]:
# nltk tokenizer
def nltk_tokenizer(text, regex=None):
  regex = regex or r"\w+"
  return RegexpTokenizer(regex).tokenize(text)

In [None]:
import re

def my_tokenizer(text):
  pattern = r"\w+"
  return re.findall(pattern, text)

In [None]:
from nltk.tokenize.treebank import TreebankWordTokenizer

def treebank_word_tokenizer(text):
  tokenizer = TreebankWordTokenizer()
  tokens = tokenizer.tokenize(text)
  return tokens

In [None]:
nltk_tokens = nltk_tokenizer(text)
my_tokens = my_tokenizer(text)
treebank_tokens = treebank_word_tokenizer(text)

In [None]:
print(text)

In [None]:
print(nltk_tokens)

In [None]:
print(my_tokens)

In [None]:
print(treebank_tokens)

### Tokenization matters
Tokenization is a critical first choice in developing text applications.  Below are some questions to consider when comparing the tokenizers.

- What are the key differences between the tokenizers?
- How do they treat punctuation?
- What happens to the link + URL?
- What is a 'good' vs a 'bad' tokenizer?  
- How would you critically select one to use in an application? 

Are either of these tokenizers perfect? Consider how you would change the tokenizer to make it effective for the Reddit data. Or for the Linkedin Messages data.



- what difference do we see between the simple tokenizer and the treebank word tokenizer?

- what are the consequences of these differences?

- when do we use the simple tokenizer and the treebank word tokenizer?

### BPE Tokenizer

Now, let's also introduce the BPE tokenizer implemented in lecture 1 and apply it to the `text`. We'd need to re-train it on the entire messages.

TODO
- In our Linkedin messages let's replace the links with the text `[URL]` using regex substitutions. We could even replace it with LINK and see what is the difference.
- eventually compare it with sentencepiece and tiktoken from openai just to see what modern tokenizers do.



#### BPE on Words Algorithm rather than on chars??? reread the first chapter and follow the lab and the lecture and see where are the differences
- begin with a vocabulary that is all the individual characters
- examine training corpus. (what does this mean???), I think this is related to below
- choose the two symbols that are most frequently adjacent, e.g. 'A' and 'B'
- add the merged symbol 'A' and 'B' with 'AB'
- continue to count and merge creating longer and longer characters strings, until *k* merges have been creating *k* novel tokens.
- *k* thus becomes the parameter of the algorithm
- the resulting vocabulary consists of the original characters plus *k* new symbols
- the newly created tokens are the tokens that should be matched first, longest first.


In [None]:
# rows is a pandas Series
def extract_corpus_text_from_pd_rows(rows) -> str:
  corpus_text = []
  for row in rows:
    corpus_text.append(row)
  output = "".join(corpus_text)
  assert isinstance(output, str), f"output is not of type str, got {type(output)}"
  return output


def create_initial_vocabulary(corpus_text:str):
  unique_chars = set()
  for c in corpus_text:
    if c not in unique_chars: unique_chars.add(c)
  
  # just some asserts
  alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
  for c in alphabet:
    if c not in unique_chars:
      raise f"character {c} not in the unique vocabulary"

  for n in range(10):
    if str(n) not in unique_chars:
      raise f"number {n} not in the unique vocabulary"
  return unique_chars

from collections import defaultdict

def get_most_frequent_tokens(tokenized_corpus_text):
  counts = defaultdict(int)
  for word_token in tokenized_corpus:
    for pair in zip(tokens[])
    counts[tokens] += 1

  sorted_counts = list(sorted(counts.items(), key=lambda item: item[1], reverse=True))
  print(sorted_counts)
  return sorted_counts[0]

# we need tokenized training_corpus, otherwise this won't work
def update_training_corpus(training_corpus, token):
  updated_training_corpus = []
  for i, c in enumerate(training_corpus):
    if i < len(training_corpus) - 1 and c + training_corpus[i+1] == token:
      updated_training_corpus.append(token)
      i += 1
    else:
      updated_training_corpus.append(c)
  return "".join(updated_training_corpus)

def add_new_token_to_vocabulary(V, new_token):
  if new_token not in V: V.add(new_token)



messages_corpus_text = extract_corpus_text_from_pd_rows(messages_dataset["content"])
print(messages_corpus_text)
print(f"len of the corpus is {len(messages_corpus_text)}")
V = create_initial_vocabulary(messages_corpus_text)






def bpe(corpus, k_merges=1):
  merges = []
  tokenized_messages_corpus_text = [list(ch) for ch in messages_corpus_text]
  vocabulary = create_initial_vocabulary(tokenized_messages_corpus_text)

  for k in k_merges:
    most_f = get_most_frequent_tokens(messages_corpus_text)
    print(most_f)
    new_token = most_f[0][0] + most_f[0][1]
    print(f"new_token={new_token}, len(new_token)={len(new_token)}")

    merges.append(new_token)

    add_new_token_to_vocabulary(V, new_token)
    print(V)

    messages_corpus_text = update_training_corpus(messages_corpus_text, new_token)
    print(messages_corpus_text)







bpe(messages_corpus_text, 1)


Below is the code to replace any link occurrence with the string `URL` or `LINK` or `[LINK]`, depending on the which one someone prefers.

In [None]:
import nltk

# nltk tokenizer with fixed regular expression
def nltk_tokenizer(text):
  regex_pattern = r"""'s|'t|'re|'ve|'m|'ll|'d| ?\d+| ?[^\sA-Za-z\d]+|\[URL\]|\S+"""
  return RegexpTokenizer(regex_pattern).tokenize(text)

messages_corpus_filepath = "../datasets/messages.csv"
messages_dataset = load_corpus(messages_corpus_filepath)

print(f"messages_dataset={len(messages_dataset)}")

messages_tokens = [] # use this to create another dataframe with the tokens? or just the text actually as one string
text = []
list_messages = []
for row in messages_dataset.itertuples(index=True, name="Pandas"):
  message = getattr(row, "content")
  text.append(message)
  # sub the links with the string (hopefully a token when tokenizing) `[URL]`
  message = re.sub(r'\b(?:https?://|ftp://|www\.)\S+', '[URL]', message)
  list_messages.append((message))

  message_tokens_nltk = nltk_tokenizer(message)

  messages_tokens.append(message_tokens_nltk) # array of arrays of string tokens

messages_frame = pd.DataFrame(list_messages, columns=["content"]) # why are we doing this? what do we need it for?
# ok we need it for the below, can we not use anything else? like messages_dataset["content"]
all_messages_tokenized_with_apply = messages_frame.content.apply(nltk_tokenizer)


#
print(type(all_messages_tokenized_with_apply))
print(type(messages_tokens))

print(len(all_messages_tokenized_with_apply), len(messages_tokens))

# check that the dataframe.apply tokenization works the same as the one done through iteration
for i in range(len(messages_tokens)):
  #print(messages_tokens[i])
  #print(all_messages_tokenized_with_apply.iloc[i])
  len_messages_tokens = len(messages_tokens[i])
  len_all_tokenized_with_apply = len(all_messages_tokenized_with_apply.iloc[i])
  if len_messages_tokens != len_all_tokenized_with_apply:
    print("mismatch", i, len(messages_tokens[i]), len(all_messages_tokenized_with_apply.iloc[i]))





In [None]:
messages_frame

Now let's create a vocabulary using the tokens created previously. Vocabulary is just the unique presence of the tokens into a python dictionary.

We already have the vocebulary when creating the BPE tokens but for the other tokenizations mechanism we need to actually create the vocab.



In [None]:
import itertools

# A single variable with the (flattened) tokens from all messages
# 14363 tokens in total in all messages??? how many unique ones??
flat_tokens_with_apply = list(itertools.chain.from_iterable(all_messages_tokenized_with_apply))
print(flat_tokens_with_apply)
print(len(flat_tokens_with_apply))





In [None]:
from collections import Counter

messages_counter_with_apply = Counter(flat_tokens_with_apply)

print(f"these are the unique tokens in the vocabulary? {len(messages_counter_with_apply)}")
print(messages_counter_with_apply)



In [None]:
# I think we can remove these
messages_counter_with_apply.most_common(25)

In [None]:
from collections import defaultdict
V = defaultdict(int)
# let's create the vocabulary
for tokens in messages_tokens:
  for token in tokens:
    V[token] += 1


In [None]:
print(V)
print(f"number of tokens = {len(V)}")

Why does the Counter have a different number of elements from the `V` vocabulary??

What do each of these represent?

In [None]:
V = sorted(V.items(), key=lambda item: item[1], reverse=True)
V

In [None]:
print(f"len messages_counter_with_apply {len(messages_counter_with_apply)}")
print(f"len(V)={len(V)}")
print(f"len(messages_tokens) - len(V)={len(messages_tokens) - len(V)}")
# what are these 25 diff?


messages_counter_list = .most_common(len())

# print(type(V)) # V is a list of tuples

not_in_V = []
for token, frequency in messages_counter.items():
  if (token, frequency) not in V:
    #print(f"this pair is not in V: {token, frequency}")
    not_in_V.append((token, frequency))


"""
So messages_counter has links that V doesn't have:
not_in_V
"""


In [None]:
print(not_in_V)

In [None]:
print(texts)
print(len(texts))

In [None]:
one_big_text = "".join(texts)
print(one_big_text)
print(f"these many chars in the messages corpus: {len(one_big_text)}")

Now, we have the messages text into one big corpus. Let's use the BPE tokenization and see what tokens do we get.

Then use the openai tiktoken and get the tokens from them and eventually compare all the tokenization mechanism.

Now, let's analyze the texts, let's see how many tokens do we have in total.

Let's use difference type of tokenizers to get the tokens and compare which one is a better fit for our task.

Let's also use the Porter Stemmer to stem our text, and also lowercase everything.

TODO what is the percentage of the blocked messages and not blocked messages?

In [None]:
# this is a pandas Series type
type(messages)

In [None]:
messages_counts = messages.value_counts()
print(messages_counts.head())
messages_counts.describe()

#### Your Task

- Print out the 50 most frequent (common) tokens in the reddit collection with their term frequencies (TF). 
  
Use the python [collections.Counter](https://docs.python.org/2/library/collections.html) library. See it's documentation for examples on how to use it

In [None]:
from collections import Counter

Counter(flat_tokens).most_common(50)


### Step 2: Text Normalization
In this section we will apply simple text normalization. We will write a function that takes raw tokens and normalizes them.

Define a python function called `normalize` that:
- Takes a sequence of *tokens* as input 
- Returns a list of *normalized tokens*
- The function should perform the following normalization: lowercasing (basic String operation) and stem the tokens using the PorterStemmer (see also the [NLTK stem package](https://www.nltk.org/api/nltk.stem.html))



In [None]:
from nltk.stem import *

stemmer = PorterStemmer()

def normalize(tokens):
  normalized_tokens = [stemmer.stem(token.lower()) for token in tokens]
  return normalized_tokens

##### Apply the normalize function to the flat tokens. 
This may take a 1-2 minutes to run over the entire collection (it is over almost 5 million tokens)

TODO update the above to reflect the actual tokens in our dataset.

In [None]:
# TODO we need to create the flat tokens
normalized_tokens = normalize(flat_tokens)
print(flat_tokens[:200])
print(normalized_tokens[:200])

### Collect information on the vocabulary

In [None]:
# Set of unique tokens (from flat_tokens)
B = set(flat_tokens)
print(f"set of unique tokens (from flat_tokens) {B}")

# Set of unique normalized tokens --> the vocabulary
V = 

# |N| - number of all tokens
print(N)

# |B|
print(len(B))

# |V| 
print(len(V)) 

** Stopwords **

The most common words are functional words, often referred to as 'stop words' because they don't convey meaningful information for 'aboutness' as discussed in the lecture. In many applications we remove stopwords to remove 'noise', but in other cases they may be important to keep. You should be able to justify your decisions for when (and what) words are 'stop words'.

e.g. in language id, stopwords are important because stop words are not shared in common across languages, usually. We can therefore identify text based on it's expected usage patterns of these words.



TODO this shouldn't be here
exercise to the reader:
- implement the treeabank word tokenizer using regular expressions

### Corpora Datasheet

Just for completeness of lecture 1. We'll explain the Linkedin messages corpus with its corpora datasheet.

- motivation for collecting the corpus
- situation: when and in what situation was text written/spoken
- language variety: what language was the corpus in?
- speaker demographics: what was e.g the age, sex of the text's authors?
- collection process: how big is the data? if it is a subsample how was it sampled? was the data collected with consent? how was the data preprocessed? and what metadata is available?
- annotation process: what are the annotations, how was the data annotated? how was the annotation process?
- distribution: are there copyright or other intellectual property restrictions?

here's the answers

- motivation: just for fun, turned into something more serious
- situation: messages received by people on linkedin
- language variety: english
- speaker demographic: different age, both sexes
- collection process: received on linkedin, copied and pasted into a json file and put into a csv by the script in `scripts/process_linkedin_messages.py`
- annotation process: annotation was manual and judged by me
- distribution: no copyright


TODO
- List things covered in the lecture but not done in the lab