## Basic Definitions

* Sentence: sequence of words that begins with a capitalized word and ends with a period.

* Tokens: Can be words, sub-units of words and punctuation marks

* Characters: represents atomic symbols that compose a string

* Corpus: large collection of writings used for linguistic analysis

* N-Gram: simply refers to N consecutive items or tokens whether words, subwords or tokens. Single tokens may be called unigrams.



## Bag of words:

Text is sequential and the order of words in a sentence matter for the meaning of the sentence. Some NLP approaches do not consider the relationships between words in a sentence. These algorithms are called 'bag of words' representations

Consider the phrases dog toy vs toy dog - 2 very different concepts. The bag of word approach will not be good at making a distinction there.



## Normalization of Vectors

Assume we create vectors based on counting the occurrences of specific words (count vectors) in a vocabulary. Some documents will be long while others will be shorter. As a results our count vectors will have size disparities (more words = higher counts). To address this we need to **normalize** our vectors.

One way to normalize a vector is to divide by the square root of the sum of squares of each element (L2 norm)





## Tokenization

In [3]:
s1 = "I like cats"
#To split the string I can call the string method split()
#By default the string function splits on whitespace.
s1.split()

['I', 'like', 'cats']

Punctuation could be important for downstream NLP tasks. For example "I hate cats." vs "I hate cats?" could mean 2 separate things. Thus choosing the right tokenization strategy is important

In [4]:
s2 = "I hate cats."
s3 = "I hate cats?"

#print out the tokens side by side for each token of s2 and s3 using the string.split() method
for i in range(len(s2.split())):
  print("s2: " + s2.split()[i] + "|  s3: " + s3.split()[i])

s2: I|  s3: I
s2: hate|  s3: hate
s2: cats.|  s3: cats?


By keeping punctuation with the words, we need more data as cats. and cats? are seen as 2 separate tokens.

### Character based tokenization:

In English there are a limited number of characters. The advantage is that our vocab size will be small.

The disadvantage is that Characters don't contain lots of information unlike word based tokenization.

### Sub-word tokenization

A middle ground between word based and character based tokenization is sub-word tokenization. Consider the example *walking* which can be decomposed into *walk* + *ing*. "Walk" is closely related to "walking"  so we want the model to have some shared representation in our ML model. ing should be seen as a modifier on the word walking. If we don't split walking, our model may see the walk as being no closer to walking than it is to "eat" or "capture".

Although sub-word tokenization is a good middle ground between word based and character based tokenization, it's not necessary for a good model.

## Stop words

Common words such as "and", "it", "the", "a", "is" etc are don't necessarily carry a lot of information on their own.

One reason to remove stop words is because they can **increase dimensionality**. Increasing dimensionality is bad - we prefer not to have high dimensional vectors because this requires more computation and memory.

Another reason is that if we use a count vectorization strategy, documents will be clustered close together based on having a similar number of the stop words. This reduces the ability to distinguish between different concepts.



### Stemming and Lemmatization

####Stemming
Stemming chops off the ends of the word i.e running -> run


In [5]:
# Most common stemming algorithm is porter stemmer
from nltk.stem import PorterStemmer
porter = PorterStemmer()
porter.stem("walking") #returns walk
porter.stem("procrastinating") #returns procrastin which is not a real word

'procrastin'

In [6]:
import nltk

In [7]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()

In [8]:
porter.stem("walking")

'walk'

In [9]:
porter.stem("walked")

'walk'

In [10]:
porter.stem("ran") #Did not return the root word run

'ran'

In [11]:
porter.stem("replacement") #returns replac - not a real word

'replac'

In [12]:
sentence = "Lemmatization is more sophisticated than stemming"
words = sentence.split() #returns a list of individual words in the sentence
for word in words:
  print(porter.stem(word), end=" ")

lemmat is more sophist than stem 

####Lemmatization

Lemmatization will give you the true root word i.e swam -> swim, procrastinating -> procrastinate, better -> good, was -> be

In [13]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk

In [14]:
#download the wordnet dictionary
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [15]:
lemmatizer = WordNetLemmatizer()
wordlist = ["mice", "going", "better"]
for word in wordlist:
  root = lemmatizer.lemmatize(word)
  print(root)
  #One problem with using lemmatizer.lemmatize is that it uses the noun POS by default
  # going is a verb and better is an adjective so
  # those need to be handled by passing the pos argument 'v' and 'a' respectively

mouse
going
better


In [16]:
lemmatizer.lemmatize("walking") #walking is a verb not a noun, so by default we will get the same word

'walking'

In [17]:
lemmatizer.lemmatize("walking" , pos= "v") #by passing the part of speech tag 'v' we are saying we want the root word of the verb "walking"

'walk'

In [18]:
lemmatizer.lemmatize("better" , pos= 'a')

'good'

### POS - Part of Speech and Lemmatization

The root form of a verb is dependent on its POS. For example:

> "Donald Trump has a devoted following"

In this example, "following" is a noun, whereas in the below example:

> "The cat was following the bird as it flew by"

"Following" is a verb in this context

In the first case, the root form of the word is "following" while in the second the root form of the word is "follow".


### Applications of Stemming and Lemmatization

- Search Engines and Document retrieval
- online ads
- social media tags

#### Search Engines

- When a user enters a query, we don't want to return only results that are exact matches of that because we would miss equally relevant results that have variations in the form of the query. Thus we can get more matches by converting the word to its root form

#### Online Advertizing

- Ads are based on keywords. Advertizers need to match your ads to the search terms. Advertizers must pay ad platforms each time their ad is shown to a user. Therefore it is important to only show the ad when the search term is relevant to the topic of the ad

### POS Tagging

NLTK has a POS tagger. However, the POS tags returned by the NLTK parts of speech tagger aren't compatible with the format the lemmatizer expects

In [19]:
def get_wordnet_pos(treebank_tag:str):
  if treebank_tag.startswith('J'):
    return wordnet.ADJ
  elif treebank_tag.startswith('V'):
    return wordnet.VERB
  elif treebank_tag.startswith('N'):
    return wordnet.NOUN
  elif treebank_tag.startswith('R'):
    return wordnet.ADV
  else:
    return wordnet.NOUN


In [21]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [22]:
words = "Donald Trump has a devoted following".split()


In [25]:
#run a POS tagger

words_and_tags = nltk.pos_tag(words) #returns a list of tuples containing the word and corresponding tag
words_and_tags

[('Donald', 'NNP'),
 ('Trump', 'NNP'),
 ('has', 'VBZ'),
 ('a', 'DT'),
 ('devoted', 'VBN'),
 ('following', 'NN')]

In [33]:
for word,tag in words_and_tags:
  lemma = lemmatizer.lemmatize(word, pos = get_wordnet_pos(tag))
  print(lemma, end = " ")


Donald Trump have a devote following 

In [34]:
words_2 = "The cat was following the bird as it flew by".split()

In [42]:
words_and_tags = nltk.pos_tag(words_2)
words_and_tags1`

[('The', 'DT'),
 ('cat', 'NN'),
 ('was', 'VBD'),
 ('following', 'VBG'),
 ('the', 'DT'),
 ('bird', 'NN'),
 ('as', 'IN'),
 ('it', 'PRP'),
 ('flew', 'VBD'),
 ('by', 'IN')]

In [40]:
for word, tags in words_and_tags:
  lemma = lemmatizer.lemmatize(word, pos = get_wordnet_pos(tags))
  print(lemma, end = " ")

The cat be follow the bird a it fly by 