![bse_logo_textminingcourse](https://bse.eu/sites/default/files/styles/bgse_image_highlighted_image/public/carrousel/BSE_Summer_School.jpg?itok=rz2Sx5Ok)

# Harnessing Language Models: Your Path to NLP Expert - Warmup

# Introduction to Natural Language Processing (NLP)

This notebook introduces you to pre-processing text. This is a very important part of text analysis when using the bag-of-words model and we will spend some time doing things by "hand" that we later use packages for. We are also using a corpus of just a few sentences where each sentence is treated as a document to illustrate the idea of the document term matrix that is central to the bag of words model.

We will do this to understand why we are pre-processing in a particular way and how pre-processing can be used to explore the corpus.

We will go through all steps:
- **Tokenization**

- **Normalization**
  - Lemmatizing
  - Removing punctuation
  - Unifying text (e.g., converting to lower case)
  - Removing stopwords
  - Stemming

- **Counting (N-grams) -> Document-Term Matrix (Vectorization)**

This is not the order you typically see on webpages talking about pre-processing but I think it is the actual order in which you approach things chronologically. Lemmatizing is the really odd one in the order. It is also the most complex and the least well-packaged. So worth spending a little time on.

We will ignore N-grams today when making counts of unigrams just to keep things as simple as possible.

Important: later parts of the course will ONLY share the tokenization step in this pipeline and the tokenizer used will be very different. This is because modern NLP does not rely on the bag-of-words model. However, bag-of-words is still often used in actual NLP applications and so it is useful to know.

Before we start we need to install a few things.

In [1]:
!pip install pandas
!pip install nltk
!pip install spacy
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [2]:

import nltk
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# Tokenizing

Tokenization is the process of breaking down a string of text into smaller pieces, called tokens. In the context of text mining and natural language processing (NLP), tokenization refers to the process of splitting text into words, phrases, symbols, or other meaningful elements, known as tokens.

In [3]:
import pandas as pd

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

import spacy
sp = spacy.load('en_core_web_sm')

import re

from nltk.corpus import stopwords


In [4]:
# Define some example sentences
sentence1 = "The president of the United States (US) is Joe Biden."
sentence2 = "Joe Biden the president wants us in a united country."
sentence3 = "Did a known artist paint portraits of Joe Biden?"
sentence4 = "A really well-known portrait artist is Vincent van Gogh."
sentence5 = "The leaves were left untouched, and the bats were unable to bat away the flies."

# Tokenize the sentences
tokens1 = nltk.word_tokenize(sentence1)
tokens2 = nltk.word_tokenize(sentence2)
tokens3 = nltk.word_tokenize(sentence3)
tokens4 = nltk.word_tokenize(sentence4)
tokens5 = nltk.word_tokenize(sentence5)

# Tag the tokens with their part of speech
print(tokens1)
print(tokens2)
print(tokens3)
print(tokens4)
print(tokens5)

['The', 'president', 'of', 'the', 'United', 'States', '(', 'US', ')', 'is', 'Joe', 'Biden', '.']
['Joe', 'Biden', 'the', 'president', 'wants', 'us', 'in', 'a', 'united', 'country', '.']
['Did', 'a', 'known', 'artist', 'paint', 'portraits', 'of', 'Joe', 'Biden', '?']
['A', 'really', 'well-known', 'portrait', 'artist', 'is', 'Vincent', 'van', 'Gogh', '.']
['The', 'leaves', 'were', 'left', 'untouched', ',', 'and', 'the', 'bats', 'were', 'unable', 'to', 'bat', 'away', 'the', 'flies', '.']


## Lemmatizer

Lemmatizer actually uses grammar - we can tell the lemmatizer through the position of the word where the word stands. But for this we need to do leave the sentence unprocessed.

Alternatively, we can pass sentences and the lemmatizer figures it out by itself. A first example of what the lemmatizer can figure out by itself.

In [5]:
print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))

rocks : rock
corpora : corpus


In [6]:
sentences=[tokens1,tokens2,tokens3,tokens4, tokens5]

for sentence in sentences:
    print(sentence)
    lemmatized_sentence = []
    for word in sentence:
        lemmatized_sentence.append(lemmatizer.lemmatize(word))
    print(" ".join(lemmatized_sentence))

['The', 'president', 'of', 'the', 'United', 'States', '(', 'US', ')', 'is', 'Joe', 'Biden', '.']
The president of the United States ( US ) is Joe Biden .
['Joe', 'Biden', 'the', 'president', 'wants', 'us', 'in', 'a', 'united', 'country', '.']
Joe Biden the president want u in a united country .
['Did', 'a', 'known', 'artist', 'paint', 'portraits', 'of', 'Joe', 'Biden', '?']
Did a known artist paint portrait of Joe Biden ?
['A', 'really', 'well-known', 'portrait', 'artist', 'is', 'Vincent', 'van', 'Gogh', '.']
A really well-known portrait artist is Vincent van Gogh .
['The', 'leaves', 'were', 'left', 'untouched', ',', 'and', 'the', 'bats', 'were', 'unable', 'to', 'bat', 'away', 'the', 'flies', '.']
The leaf were left untouched , and the bat were unable to bat away the fly .


This has very little impact on the text. However, lemmatizing becomes much more powerful if use the function of words in a text. We will now switch to the package spacy quickly because it has an immediate implementation of positioning.

Below, spacy is called as sp. We pass it the sentence and it generates an object saved under sp_text. Spacy is very powerful with lots of functions. We will only use the word positioning which is stored under word.pos_. For more read this here: https://spacy.io/usage/spacy-101

In [7]:
sentences=[sentence1,sentence2,sentence3,sentence4, sentence5]

for sentence in sentences:
    print(sentence)
    sp_text=sp(sentence)
    lemmatized_sentence = []
    for word in sp_text:
        print(word, word.pos_)
        lemmatized_sentence.append(word.lemma_)
    print(" ".join(lemmatized_sentence))
    print(" ")


The president of the United States (US) is Joe Biden.
The DET
president NOUN
of ADP
the DET
United PROPN
States PROPN
( PUNCT
US PROPN
) PUNCT
is AUX
Joe PROPN
Biden PROPN
. PUNCT
the president of the United States ( US ) be Joe Biden .
 
Joe Biden the president wants us in a united country.
Joe PROPN
Biden PROPN
the DET
president PROPN
wants VERB
us PRON
in ADP
a DET
united PROPN
country NOUN
. PUNCT
Joe Biden the president want we in a united country .
 
Did a known artist paint portraits of Joe Biden?
Did VERB
a DET
known ADJ
artist NOUN
paint NOUN
portraits NOUN
of ADP
Joe PROPN
Biden PROPN
? PUNCT
do a known artist paint portrait of Joe Biden ?
 
A really well-known portrait artist is Vincent van Gogh.
A DET
really ADV
well ADV
- PUNCT
known VERB
portrait NOUN
artist NOUN
is AUX
Vincent PROPN
van PROPN
Gogh PROPN
. PUNCT
a really well - know portrait artist be Vincent van Gogh .
 
The leaves were left untouched, and the bats were unable to bat away the flies.
The DET
leaves NOUN
w

## Remove punctuation and unify, e.g. convert to lower case

We now start back from the tokenized text as if we did not decide to do lemmatizing. Obviously, this is not what you want to do if lemmatizing is part of your pipeline.

### Tokenize again

In [8]:
for sentence in sentences:
    print(sentence)
    words = sentence.split()
    print(words)

The president of the United States (US) is Joe Biden.
['The', 'president', 'of', 'the', 'United', 'States', '(US)', 'is', 'Joe', 'Biden.']
Joe Biden the president wants us in a united country.
['Joe', 'Biden', 'the', 'president', 'wants', 'us', 'in', 'a', 'united', 'country.']
Did a known artist paint portraits of Joe Biden?
['Did', 'a', 'known', 'artist', 'paint', 'portraits', 'of', 'Joe', 'Biden?']
A really well-known portrait artist is Vincent van Gogh.
['A', 'really', 'well-known', 'portrait', 'artist', 'is', 'Vincent', 'van', 'Gogh.']
The leaves were left untouched, and the bats were unable to bat away the flies.
['The', 'leaves', 'were', 'left', 'untouched,', 'and', 'the', 'bats', 'were', 'unable', 'to', 'bat', 'away', 'the', 'flies.']


### Unify 1 - getting rid of non-words


This regular expression below (re.split(r'\W+', sentence)) splits the input sentence into words by using any sequence of non-word characters (such as spaces, punctuation, etc.) as delimiters. This is useful for preprocessing text because it effectively separates and extracts individual words, making the text easier to process, analyze, or manipulate for tasks like word frequency analysis, natural language processing, or text mining.

In [9]:
for sentence in sentences:
    print(sentence)
    words = re.split(r'\W+', sentence)
    print(words)

The president of the United States (US) is Joe Biden.
['The', 'president', 'of', 'the', 'United', 'States', 'US', 'is', 'Joe', 'Biden', '']
Joe Biden the president wants us in a united country.
['Joe', 'Biden', 'the', 'president', 'wants', 'us', 'in', 'a', 'united', 'country', '']
Did a known artist paint portraits of Joe Biden?
['Did', 'a', 'known', 'artist', 'paint', 'portraits', 'of', 'Joe', 'Biden', '']
A really well-known portrait artist is Vincent van Gogh.
['A', 'really', 'well', 'known', 'portrait', 'artist', 'is', 'Vincent', 'van', 'Gogh', '']
The leaves were left untouched, and the bats were unable to bat away the flies.
['The', 'leaves', 'were', 'left', 'untouched', 'and', 'the', 'bats', 'were', 'unable', 'to', 'bat', 'away', 'the', 'flies', '']


### Unify 2 - more elegant

In the context below of re.sub(r'\W+', '', word), the term "sub" stands for "substitute". It's a function in Python's regular expression (re) module that is used for substituting all occurrences of a specified pattern in a given string with another string. Here, the pattern \W+ (which matches any sequence of non-word characters) is being replaced with an empty string '', effectively removing these characters from the input word. This substitution feature is a powerful tool in text processing for modifying and cleaning data.

Why the part "if strip(w)" in the command [strip(w) for w in words if strip(w)]? Because it only allows elements in that are not empty. This can happen if several non-word characters follow each other.

In [10]:
#let's write our own function to do this
def strip(word):
    mod_string = re.sub(r'\W+', '', word)
    return mod_string

for sentence in sentences:
    words = sentence.split()
    stripped = [strip(w) for w in words if strip(w)]
    print(stripped)

['The', 'president', 'of', 'the', 'United', 'States', 'US', 'is', 'Joe', 'Biden']
['Joe', 'Biden', 'the', 'president', 'wants', 'us', 'in', 'a', 'united', 'country']
['Did', 'a', 'known', 'artist', 'paint', 'portraits', 'of', 'Joe', 'Biden']
['A', 'really', 'wellknown', 'portrait', 'artist', 'is', 'Vincent', 'van', 'Gogh']
['The', 'leaves', 'were', 'left', 'untouched', 'and', 'the', 'bats', 'were', 'unable', 'to', 'bat', 'away', 'the', 'flies']


### Unify 3 - add lowercasing

Lowercasing is super easy to integrate in this pipeline.

In [11]:
for sentence in sentences:
    words = sentence.split()
    lowered = [strip(w).lower() for w in words]
    print(lowered)

['the', 'president', 'of', 'the', 'united', 'states', 'us', 'is', 'joe', 'biden']
['joe', 'biden', 'the', 'president', 'wants', 'us', 'in', 'a', 'united', 'country']
['did', 'a', 'known', 'artist', 'paint', 'portraits', 'of', 'joe', 'biden']
['a', 'really', 'wellknown', 'portrait', 'artist', 'is', 'vincent', 'van', 'gogh']
['the', 'leaves', 'were', 'left', 'untouched', 'and', 'the', 'bats', 'were', 'unable', 'to', 'bat', 'away', 'the', 'flies']


### Do you see a problem?

There is a problem with the code above. Depending on your application you don't like what it does.

What does the code below do? (remove period to check)

### Unify 4 - special lowercasing

In [12]:
def abbr_or_lower(word):
    if re.match('([A-Z]+[a-z]*){2,}', word):
        return word
    else:
        return word.lower()

corpus=[]
for sentence in sentences:
    words = sentence.split()
    lowered = [abbr_or_lower(strip(w)) for w in words]
    print(lowered)
    corpus.append(lowered)

['the', 'president', 'of', 'the', 'united', 'states', 'US', 'is', 'joe', 'biden']
['joe', 'biden', 'the', 'president', 'wants', 'us', 'in', 'a', 'united', 'country']
['did', 'a', 'known', 'artist', 'paint', 'portraits', 'of', 'joe', 'biden']
['a', 'really', 'wellknown', 'portrait', 'artist', 'is', 'vincent', 'van', 'gogh']
['the', 'leaves', 'were', 'left', 'untouched', 'and', 'the', 'bats', 'were', 'unable', 'to', 'bat', 'away', 'the', 'flies']


### Apply the unification after Lemmatization

We return to the POS plus lemmatization pipeline above and now add the new tricks after this.

In [13]:
#we use the exact same loop but now collect in a list
corpus_pos_lemm=[]
for sentence in sentences:
    sp_text=sp(sentence)
    lemmatized_sentence = []
    for word in sp_text:
        lemmatized_sentence.append(word.lemma_)
    #if abbr_or_lower(w).strip() gets rid of empties ''
    lowered = [abbr_or_lower(strip(w)) for w in lemmatized_sentence if abbr_or_lower(strip(w))]
    print(lowered)
    corpus_pos_lemm.append(lowered)


['the', 'president', 'of', 'the', 'united', 'states', 'US', 'be', 'joe', 'biden']
['joe', 'biden', 'the', 'president', 'want', 'we', 'in', 'a', 'united', 'country']
['do', 'a', 'known', 'artist', 'paint', 'portrait', 'of', 'joe', 'biden']
['a', 'really', 'well', 'know', 'portrait', 'artist', 'be', 'vincent', 'van', 'gogh']
['the', 'leave', 'be', 'leave', 'untouched', 'and', 'the', 'bat', 'be', 'unable', 'to', 'bat', 'away', 'the', 'fly']


## Stopwords

This is a very interesting concept. Stopwords are something we agree on are generally not meaningful to distinguish one document from another.

Note: the idea of stopwords is closely linked to the bag-of-words model where grammar doesn't play a role.

In [14]:
corpus_stop=[]
for sentence in sentences:
    words = sentence.split()
    lowered = [abbr_or_lower(strip(w)) for w in words if abbr_or_lower(strip(w)) not in set(stopwords.words('english'))]
    print(lowered)
    corpus_stop.append(lowered)


['president', 'united', 'states', 'US', 'joe', 'biden']
['joe', 'biden', 'president', 'wants', 'us', 'united', 'country']
['known', 'artist', 'paint', 'portraits', 'joe', 'biden']
['really', 'wellknown', 'portrait', 'artist', 'vincent', 'van', 'gogh']
['leaves', 'left', 'untouched', 'bats', 'unable', 'bat', 'away', 'flies']


## Stemming

"Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language."

nltk allows you to choose between PorterStammer or LancasterStammer, PorterStemmer being the oldest one originally developed in 1979. LancasterStemmer was developed in 1990 and uses a more aggressive approach than Porter Stemming Algorithm. Then we have the SnowballStemmer.

For Spanish we have the SnowballStemmer.

Stemming is standard in text mining. But it does not always make sense. My experience is that it really depends on the application whether this is worth it. One important aspect is that sometimes stemming is quite brutal leading to very harsh cuts. Can be useful if dimensionality reduction is an important concern.

In [15]:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

#english and spanish language
from nltk.stem.snowball import SnowballStemmer

porter = PorterStemmer()
lancaster=LancasterStemmer()
englishStemmer=SnowballStemmer("english")
spanishStemmer=SnowballStemmer("spanish")

In [16]:
#A list of words to be stemmed
word_list = ["friend", "friendship", "friends", "friendships","stable","stabilize","stabilized", "misunderstanding","footballers", "fighter", "fought"]
print("{0:20}{1:20}{2:20}{3:30}".format("Word","Porter Stemmer","Lancaster Stemmer", "Snowball Stemmer"))
for word in word_list:
    print("{0:20}{1:20}{2:20}{3:30}".format(word,porter.stem(word),lancaster.stem(word), englishStemmer.stem(word)))

Word                Porter Stemmer      Lancaster Stemmer   Snowball Stemmer              
friend              friend              friend              friend                        
friendship          friendship          friend              friendship                    
friends             friend              friend              friend                        
friendships         friendship          friend              friendship                    
stable              stabl               stabl               stabl                         
stabilize           stabil              stabl               stabil                        
stabilized          stabil              stabl               stabil                        
misunderstanding    misunderstand       misunderstand       misunderstand                 
footballers         footbal             footbal             footbal                       
fighter             fighter             fight               fighter                       

In [17]:
#A spanish list of words to be stemmed
word_list = ["amigo", "amigos", "amigas", "correr","corriente","incertidumbre","luchadores", "desempleo","fueron"]
print("{0:20}{1:20}".format("Word","Snowball Stemmer"))
for word in word_list:
    print("{0:20}{1:20}".format(word,spanishStemmer.stem(word)))

Word                Snowball Stemmer    
amigo               amig                
amigos              amig                
amigas              amig                
correr              corr                
corriente           corrient            
incertidumbre       incertidumbr        
luchadores          luchador            
desempleo           desemple            
fueron              fueron              


#### Bringing stemming back to the pipeline

Note that the stemmer has a lower case maker but not a punctuation strip included. I therefore kicked out the abbr_or_lower() function but left the strip(w) in place.

In [18]:
corpus_stem=[]
for sentence in sentences:
    words = sentence.split()
    lowered = [englishStemmer.stem(strip(w)) for w in words]
    print(lowered)
    corpus_stem.append(lowered)

['the', 'presid', 'of', 'the', 'unit', 'state', 'us', 'is', 'joe', 'biden']
['joe', 'biden', 'the', 'presid', 'want', 'us', 'in', 'a', 'unit', 'countri']
['did', 'a', 'known', 'artist', 'paint', 'portrait', 'of', 'joe', 'biden']
['a', 'realli', 'wellknown', 'portrait', 'artist', 'is', 'vincent', 'van', 'gogh']
['the', 'leav', 'were', 'left', 'untouch', 'and', 'the', 'bat', 'were', 'unabl', 'to', 'bat', 'away', 'the', 'fli']


## Vectorization

Soon comes the most important step in the pipeline. We will now represent the sentences as vectors - through a count of their terms. We will for now do this is the simplest possible form without dropping anything and focusing on unigrams.

Conceptually important here is the fact that we want to construct a dictionary of how positions in the vector (the index) relate to terms. But we can then forget about the meaning of the index and just treat the documents vectors of numbers. This is wonderful as it allows us to do all the cool stuff that we can do with vectors.

The following code goes through the corpus that we made above. Let's try different ways of making the corpus and think about the vectors this generates.

### Start with lowercased corpus to build vocabulary

In [19]:
vocab, index = {}, 1  # start indexing from 1
for doc in corpus:
    for token in doc:
      if token not in vocab:
        vocab[token] = index
        index += 1

vocab_lower=vocab
vocab_lower_size = len(vocab)
print(vocab_lower)
print(" ")
print("Total size of vocabulary is:", vocab_lower_size)

{'the': 1, 'president': 2, 'of': 3, 'united': 4, 'states': 5, 'US': 6, 'is': 7, 'joe': 8, 'biden': 9, 'wants': 10, 'us': 11, 'in': 12, 'a': 13, 'country': 14, 'did': 15, 'known': 16, 'artist': 17, 'paint': 18, 'portraits': 19, 'really': 20, 'wellknown': 21, 'portrait': 22, 'vincent': 23, 'van': 24, 'gogh': 25, 'leaves': 26, 'were': 27, 'left': 28, 'untouched': 29, 'and': 30, 'bats': 31, 'unable': 32, 'to': 33, 'bat': 34, 'away': 35, 'flies': 36}
 
Total size of vocabulary is: 36


### Stopword removed corpus

Check out the total size of the vocabulary below. Why do we lose dimensions compared to the lowercased version?

In [20]:
vocab, index = {}, 1  # start indexing from 1
for doc in corpus_stop:
    for token in doc:
      if token not in vocab:
        vocab[token] = index
        index += 1

vocab_stop=vocab
vocab_stop_size = len(vocab)
print(vocab_stop)
print(" ")
print("Total size of vocabulary with stopwords removed is:", vocab_stop_size)

{'president': 1, 'united': 2, 'states': 3, 'US': 4, 'joe': 5, 'biden': 6, 'wants': 7, 'us': 8, 'country': 9, 'known': 10, 'artist': 11, 'paint': 12, 'portraits': 13, 'really': 14, 'wellknown': 15, 'portrait': 16, 'vincent': 17, 'van': 18, 'gogh': 19, 'leaves': 20, 'left': 21, 'untouched': 22, 'bats': 23, 'unable': 24, 'bat': 25, 'away': 26, 'flies': 27}
 
Total size of vocabulary with stopwords removed is: 27


### Stemmed corpus

Why do we lose dimensions compared to the lowercased version?

In [21]:
vocab, index = {}, 1  # start indexing from 1
for doc in corpus_stem:
    for token in doc:
      if token not in vocab:
        vocab[token] = index
        index += 1

vocab_stem=vocab
vocab_stem_size = len(vocab)
print(vocab_stem)
print(" ")
print("Total size of stemmed vocabulary:", vocab_stem_size)

{'the': 1, 'presid': 2, 'of': 3, 'unit': 4, 'state': 5, 'us': 6, 'is': 7, 'joe': 8, 'biden': 9, 'want': 10, 'in': 11, 'a': 12, 'countri': 13, 'did': 14, 'known': 15, 'artist': 16, 'paint': 17, 'portrait': 18, 'realli': 19, 'wellknown': 20, 'vincent': 21, 'van': 22, 'gogh': 23, 'leav': 24, 'were': 25, 'left': 26, 'untouch': 27, 'and': 28, 'bat': 29, 'unabl': 30, 'to': 31, 'away': 32, 'fli': 33}
 
Total size of stemmed vocabulary: 33


### Lemmatize then unified

Why do we lose dimensions compared to the lowercased version?

In [22]:
vocab, index = {}, 1  # start indexing from 1
for doc in corpus_pos_lemm:
    for token in doc:
      if token not in vocab:
        vocab[token] = index
        index += 1

vocab_pos_lemm=vocab
vocab_pos_lemm_size = len(vocab)
print(vocab_pos_lemm)
print(" ")
print("Total size of stemmed vocabulary:", vocab_pos_lemm_size)

{'the': 1, 'president': 2, 'of': 3, 'united': 4, 'states': 5, 'US': 6, 'be': 7, 'joe': 8, 'biden': 9, 'want': 10, 'we': 11, 'in': 12, 'a': 13, 'country': 14, 'do': 15, 'known': 16, 'artist': 17, 'paint': 18, 'portrait': 19, 'really': 20, 'well': 21, 'know': 22, 'vincent': 23, 'van': 24, 'gogh': 25, 'leave': 26, 'untouched': 27, 'and': 28, 'bat': 29, 'unable': 30, 'to': 31, 'away': 32, 'fly': 33}
 
Total size of stemmed vocabulary: 33


### Moving on. Let's vectorize our documents using the lowercased vocabulary.

Let's run a count of terms in the vocab_lower on the first lowercased sentense.

Make sure you understand what the following cell does.

In [23]:
for w in vocab_lower:
    print(w, corpus[4].count(w))

the 3
president 0
of 0
united 0
states 0
US 0
is 0
joe 0
biden 0
wants 0
us 0
in 0
a 0
country 0
did 0
known 0
artist 0
paint 0
portraits 0
really 0
wellknown 0
portrait 0
vincent 0
van 0
gogh 0
leaves 1
were 2
left 1
untouched 1
and 1
bats 1
unable 1
to 1
bat 1
away 1
flies 1


We will write a function that uses this. Make sure you understand what the following thing does.

In [24]:
def vectorize(tokens, vocab):
    vector=[]
    for w in vocab:
        vector.append(tokens.count(w))
    return vector

Now we apply this to the entire lowercased corpus, document by document.

In [25]:
vectors=[]
for doc in corpus:
    vectors.append(vectorize(doc, vocab_lower))


Let's visualize this collection of vectors that we have generated from the documents making counts of terms in the dictionary.

In [26]:
#making a pandas dataframe
df = pd.DataFrame(vectors)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,26,27,28,29,30,31,32,33,34,35
0,2,1,1,1,1,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,1,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,1,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,3,0,0,0,0,0,0,0,0,0,...,2,1,1,1,1,1,1,1,1,1


The above is a huge step. This is a mathematical vector representation of the corpus! This is how machines in the bag-of-words model see language.

## How is this animal called?

The name of this matrix has been mentioned above and in the lecture.

This is a key item we will come back to again and again - it is called the...

### What are the correct column names here?
For the machine the columns of the document-term matrix are index numbers. But for humans it is useful to think about what the columns represent. The vocab_lower dictionaries has them stored. What will they be?

Think before removing the period.

In [27]:
df.columns = vocab_lower.keys()
df
.

SyntaxError: invalid syntax (<ipython-input-27-db6bbb118ee0>, line 3)

### Distances between texts

Let's have some short fun with the matrix. We will transpos the document term matrix and then look at the column vectors of this new matrix.

Note that each column now is a representation of a document. As they are vectors we can now calculate the distances between the different sentences as shown in the lecture.


Remember the sentences were:

sentence1 = "The president of the United States (US) is Joe Biden."

sentence2 = "Joe Biden the president wants us in a united country."

sentence3 = "Did a known artist paint portraits of Joe Biden?"

sentence4 = "A really well-known portrait artist is Vincent van Gogh."

sentence5 = "The leaves were left untouched, and the bats were unable to bat away the flies."

Which sentences are most similar do you think?


In [28]:
df=df.transpose()
df


Unnamed: 0,0,1,2,3,4
0,2,1,0,0,3
1,1,1,0,0,0
2,1,0,1,0,0
3,1,1,0,0,0
4,1,0,0,0,0
5,1,0,0,0,0
6,1,0,0,1,0
7,1,1,1,0,0
8,1,1,1,0,0
9,0,1,0,0,0


In [29]:
#distance of vectors to the second sentence df[1]

for sentence in sentences:
    print(sentence)
print(" ")
for i in range(0,5):
    result=abs(df[1]-df[i])
    print("Difference of sentence",i+1, "to sentence 2 is", result.sum(axis=0))



The president of the United States (US) is Joe Biden.
Joe Biden the president wants us in a united country.
Did a known artist paint portraits of Joe Biden?
A really well-known portrait artist is Vincent van Gogh.
The leaves were left untouched, and the bats were unable to bat away the flies.
 
Difference of sentence 1 to sentence 2 is 10
Difference of sentence 2 to sentence 2 is 0
Difference of sentence 3 to sentence 2 is 13
Difference of sentence 4 to sentence 2 is 17
Difference of sentence 5 to sentence 2 is 23


# WOW THIS IS AMAZING - YOUR LIFE HAS CHANGED
(a tiny little bit)

# Exercise (difficult)

Write yourself a pipeline for processing a csv file into a document term matrix using **lowercasing, stopword removal and stemming**. Write the pipeline such that you can apply the loop "for sentence in sentences:" above to the column of a pandas data frame. Use the pieces of the pipeline we discussed in class. Do not use other packages.