## Vector Models and text processing 
### By Zaky Riyadi
- To convert text into numbers

- **Sentence**: a sequence of word that typically starts with capitalized worf and ends with punctuation.
- **Token**: can be reffered to as words, punctuation or sub-units of words (Therefore sentence is sequence of tokens).
- **Tokenization** is a way of separating a piece of text into smaller units called tokens. Tokens can be either words, characters, or subwords (n-gram characters) tokenization.
- **Characters**: commonly reffered to as letters, punctuation and whitespace.
- **Vocabulary**: the set of "all" words.
- **Corpus**: A large collection of writings of a specific kind or on a specific subject.
- **N-Gram**: 
    - Refers to N consecutive items (e.g: words, subwords, characters).
    - Single (one) item: unigram, two item: bigram, three items: trigram.
    - use-case: the word2vec, Markov models.
- What is **vectors**?
    - Vectors are how we will represent text numerically and the foundation of ML techniques.
    - Conventional definition: A quantity that has both magnitude and direction.
    - ML definition: An array of scalars.
    - a vector as a list of numbers, and vector algebra as operations performed on the numbers in the list.
    
- **Bag of words (BOW)**
    - text is sequential and the sequence of word gives the text meaning. Eg: when the words are randomized, it would change meaning or making it incomprehensible.
    - Despite this, many NLP method does **not consider the word order**, and this we call **bag of words**.
    - BOW is commonly used by Vector models and classic ML models.
    - Commonly, probabilistic and DL model do not use BOW approach.
 
- **Count Vectorizer** 
    - Simple approach to converting text into vectors using BOW approach.
    - Practical Issues:
        - Text is represented in computer as string, but the algorithm requires counting individual words. --> **Tokenization**
        - **Tokenization**: Convert a string (containing multiple of words), into list of words (each of which are strings).
        - how do we know which word corresponds to which position of vector --> **Mapping from word to index**

- **Tokenization**
    - Similar to the string split function .split().
- Few thing to consider:
    - **Punctuation** may be important for downstream NLP tasks (eg: "I hate cats." VS. "I hate cats?). Depend on the objective ot results whether punctuation is important. Note, higher dimension is required if not considered (eg. "cats." and "cats?" is 2 different token.
    - SKLearn CountVectorizer ignote punctuatuion.
    -**Casing**: consider sentiment analysis or spam detection. Eg.: does "cat" have the same meaning as "Cat"
        - use string.lower() function in python or CountVectorizer(lowercase=True) is SKLearn. 
    -**Accents**: less common in english
        - use CountVectorizer(strip_accents=True)
    - **Word-based tokenization**: word is more meaning full than character (eg:cats, dogs, etc). However, when there are too many words, it can take alot of spaces and using DL model accuracy may take an effeect since every word generate probabilistic distribution (too much).
    - **Character based tokenization**: character only have 26 letters, and characters represent whitespaces and punctuation. which does not take alot of space and easy to represent in a computer. However, it is less meaningfull than word base.
    -**Subword-Based Tokenization**: Middle ground between word-based and character-based. Eg Walking -> "walk" + "ing". Walk is close related to walking and can be represent as the token "walk". While "ing" can be applied to "walking", "eating". However, more data is needed to let the model know that "walk" and "walking" are similar.
    - SKLearn tokenization -> word-based  = CountVectorizer(analyzer="word") (**default**), character-based = CountVectorizer(analyzer="char")
 
- **Stopwords**: 
    - commonly used words that carry very little useful information. Eg: "a","the","is","and","are". it is better to ignore to reduce dimension.
    - in SKLearn - set CountVectorizer(stop_words="english"/"list_of_user_defined_words"). The default is None
    - use nltk libary (language includes, english, arabic, germany,..)

- **Stemming and Lemmatization**:
    - uisng basic word tokenization, each variation of word has its own vector component. Eg:"walk", "walking", "walks","walked".
    - this can leads to high dimensional vectors. The solution is to convert words to their root word.
    - 2 common approach are stemming (remove the end of the word and can be inaccurate) and Lemmatization (uses actual rules of language and the true root word is returned)
    - **Stemming eg**:
      ```
      from nltk.stem import PorterStemmer
      
      porter = PorterStemmer()
      porter.stem("talking") # returns 'talk'
      ```
    - **Lemmatization**:
      ```
    from nltk.stem import WordNetLemmatizer
    from nltk.corpus import wordnet
    
    nltk.download("wordnet") # only need to do once
    
    lemmatizer = WordNetLemmatizer()
    lemmatizer.lemmatize("mice") # returns 'mouse'
    lemmatizer.lemmatize("going") # returns 'going'
    lemmatizer.lemmatize("going", pos=wordnet.VERB) # returns 'go', pos = "part of speech"
      ```
      - Inorder to properly use the lemmetizer, POS tagging is required
    - map tags is required to allow WordNet and NLTK to be compitable.
    - stemming/lemmatization are used in search engines, document retrieval, online ads, social media tags

In [1]:
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer # import stemming and lemmatizer
# Stemming
porter = PorterStemmer
porter = PorterStemmer()
porter.stem("walking"),porter.stem("walked"),porter.stem("ran"),porter.stem("running"),porter.stem("bosses"),porter.stem("replacement")

('walk', 'walk', 'ran', 'run', 'boss', 'replac')

In [2]:
sentence = "Lemmatization is more sophisticated than stemming".split() # split the sentence
for token in  sentence:
    print(porter.stem(token), end=" ")

lemmat is more sophist than stem 

In [16]:
# Lemmmatization 
nltk.download('omw-1.4')
# nltk.download("wordnet")
from nltk.corpus import wordnet 

lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("walking"), lemmatizer.lemmatize("walking", pos=wordnet.VERB), lemmatizer.lemmatize("going"), lemmatizer.lemmatize("going", pos=wordnet.VERB), lemmatizer.lemmatize("mice"), lemmatizer.lemmatize("was"), lemmatizer.lemmatize("was", pos=wordnet.VERB), lemmatizer.lemmatize("better", pos=wordnet.ADJ)


[nltk_data] Downloading package omw-1.4 to C:\Users\ZAKY-
[nltk_data]     PC\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


('walking', 'walk', 'going', 'go', 'mouse', 'wa', 'be', 'good')

In [26]:
# automate the pos tagging
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [27]:
# download tagger
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ZAKY-PC\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [28]:
sentence = "Donald Trump has a devoted following".split()
words_and_tags = nltk.pos_tag(sentence)
print(words_and_tags)
print('----------------')
for word, tag in words_and_tags:
    lemma = lemmatizer.lemmatize(word, pos=get_wordnet_pos(tag))
    print(lemma, end=" ")

[('Donald', 'NNP'),
 ('Trump', 'NNP'),
 ('has', 'VBZ'),
 ('a', 'DT'),
 ('devoted', 'VBN'),
 ('following', 'NN')]

In [34]:
sentence_2 = "the cat was following the bird as it flew by".split()
words_and_tags = nltk.pos_tag(sentence_2)
print(words_and_tags)
print('----------------')
for word, tag in words_and_tags:
    lemma = lemmatizer.lemmatize(word, pos=get_wordnet_pos(tag))
    print(lemma, end=" ")

[('the', 'DT'), ('cat', 'NN'), ('was', 'VBD'), ('following', 'VBG'), ('the', 'DT'), ('bird', 'NN'), ('as', 'IN'), ('it', 'PRP'), ('flew', 'VBD'), ('by', 'IN')]
----------------
the cat be follow the bird a it fly by 

## 15. Vector Similarity
- Finding similarity between words
- Application: Article Spinning: rewrite existing articles by replacing every few similar words.

We assume vectors in ML live in euclidean space. so each component is the length of the vector in that respective direction. Therefore, we can use the **distance** to define the **similarity** (the further the distance, the less similar).

- Finding vector similarity:
    - **Euclidean Distance**:
    $$||x-y||_2 = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + ... + (x_D - y_D)^2} $$
     - Take 2 vectors of *x* and *y* each with size D and find the difference of each component and stuare the difference and add them together and take the square root of the result.

    - Angle between 2 vectors (**cosine of the angle**)
        - relationship between cosine distance and cosine similarity can be represent as: 
          **Cosine distance = 1- Cosine Similarity**
        $$
        dist = 1 - sim \\
        dist = 1 - 1 \\
        dist = 0 
        $$
        - Therefore, when it is very similar (1), the distance is 0 (eg: parallel). Where else, if the two vectors are anti-parallel, the cosine similarity become -1. Thus the cosine distance is 2 (maximum possible distance)
        - Cosine distance is not a true distance metric as it does not satisfy the triangle inequality
        
    - Which one to use?
        - when vectors of different sizes cosine distance might not be useful/accurate
    - When are cosine distance and euclidean distance equivalent?
        - when you normalize your vectors and only task is to rank items, then it doesnt matter which similarity or  distance to use.

##  18. Term Frequency – Inverse Document Frequency (TF-IDF)
 - method for count vectorizer
 - commonly used for documents retrieval and text mining
 - what's wrong with count vectorizer?
     - Stopwords is not useful for NLP tasks since it is in every documents/ articles
     - How do we know our list of stopword is correct? -> We do not! since stop words can be application specific
 - **TF-IDF**:
 $$
     TF - IDF = \frac{Term Frequency}{Document Frequency}\
 $$
         - Term Frequency: the number of count it appears
         - Document Frequency: the number of document the word appears 
     - Document frequency increase when the word apear more in more documents
     - Common TF_IDF variation:
     $$
     tfidf(t,d) = tf(t,d) * idf(t)
     $$
         - where t is term and d is document
       - IDF:
       $$
       idf(t) = log \frac{N}{N(t)}
       $$
         - Where N is total number documents and thus N(t) the number of documents term t appears in.
         - log function is used to reduce its arguments (lets say 1000000 documents, using log can reduce the number without alternate the result/meaning)
     - TF-IDF in python
     ```
     from sklearn.feature_extraction.text import TfidfVectorizer
     tfidf = TfidfVectorizer()
     x_train = tfidf.fit_transform(train_texts)
     x_test = tfidf.transform(test_texts)
     
     # note: arguments exist for stopwords, tokenizer, strip accents, etc
     ```
     
     - **TF variations**:
         - Binary (1 if word appears, 0 otherwise)
         - Normalize the count
         $$
         tf(t,d) = \frac{count(t,d)}{ \sum \limits _{t'\in terms(d)}count(t',d)}
         $$
         - Take the log
         $$ tf(t,d) = log(1+count(t,d)) $$
     
     - **IDF variation**:
         - Smooth IDF: preventing IDF = 0
         $$ idf(t) = log \frac{N}{N(t)+1} +1 $$
         - IDF Max: using maximum term count from the same document and thus the ratio instide the log is relative to  the current document instead of the whole dataset.
         $$ idf(t) = log \frac{ \underset{t'\in terms{d}}{max}N{t'}}{N(t)} $$
         - probabilistic IDF
         $$ idf(t) = log \frac{N - N(t)}{N(t)} $$

## 19. Word-to-Index Mapping
- When converting documents to vectors, the result is a **document-term matrix** 
- Row = document, column = term (size = #documents x #terms)
- Which column corresponds to which word?
  `` 
  current_idx = 0 # initalize a variable called current index  to 0
  word2idx = {} # empty dictionary
  for doc in documents: # loop through documents
  tokens = word_tokenize(doc) # tokenize the documents
  for token in tokens:
      if token not in word2idx: # Check if token exist in word2idx dictionary
          word2idx[token] = current_idx # if does not exist, dictionary ss thecurent token and the corresponding values become the index
          current_idx += 1 # increment the index
   ```
- Word-to-index mapping -> Count Vectorizer -> TF-IDF
- What to do with words in the  test set but but in train set?
    1. Ignore those words
    2. Create special index for unknown/rare words
- Reverse mapping (index-to-eord) is necessary to allow us know which features are reffered to when the output is the index (therefore index to word)


## 21. Neural Word Embeddings
- method to create vector out of text
- word embedding is similar to word vectors
- typically use to convert words into vectors instead of documents. thus a document become sequence of vectors
- more information than bag of words (the order is considered unlike bag of words)
- 2 common method: word2vex and Glove
    - **Word2vec** (feed forward Neural Network)
        - Embeddings are stored in the weight of the NN
        - Need to find the weight of the NN, which effectively are the word embedding themselves
        - given an input word, it predict whhether an output word appears in its context
        - eg: given a sentence of "The quick brown fox jumps over the lazy dogs" and the input is jumps, the correct output (in binary) should be words that are close to the word "jumps"
    - **Glove**
        - does not use ANN primarly. However, at later stages, it uses
        - similar to recommender system
        - look for patterns in which users liked which movies (eg: if Bob and Alice rate movies similarly, then  they can be sued to estimate each other's unknown ratings)
        - eg: given a sentence of "The quick brown fox jumps over the lazy dogs" and the input is jumps, it score fox to 1/1 (closest) and brown 1/2 (further). the further the word is the lower score it gets.
        - thus raing is based on the distance between words