# Natural Language Processing (NLP) Intro
##### Enabling computers to understand and manipulate human language. 🤖💬

<div style="display: flex; flex-direction: row;">
    <div style="flex: 2; padding:10px; margin:10px;">

- Text feature extraction -> converting raw text data into numerical representations that can be processed by ML algorithms.

A common approach to text feature extraction is to represent each piece of text as a vector, with **each dimension of the vector corresponding to a specific feature of the text**, such as the frequency of certain words or the presence of certain phrases.

**Tokenization** involves breaking down a text into individual units known as tokens, such as words, phrases, or other meaningful units.

## TF, IDF & TF-IDF

Term frequency (TF) measures the relative frequency of each term/token in a given document.

- By using TF, we can create a vocabulary of all the terms in our corpus and represent each document as a vector of term frequencies.

Simply using term frequency can result in common words (such as "the" and "and") being weighted too heavily, so we can use a more sophisticated approach called <a href="https://en.wikipedia.org/wiki/Tf–idf">Term-Frequency Inverse Document Frequency (TF-IDF)</a> instead.

IDF (Inverse Document Frequency) takes into account how frequently a term appears in the entire corpus of documents and TF-IDF score for each term is the product of its TF and IDF values.

By using TF-IDF, we can better represent the importance of each term in a given document and compare the similarity of different documents.

</div>
    <div style="flex: 1.5; padding:0px; margin:20px;">

### **Who is this BERT?**
In 2017, an NLP research paper called <a href="https://arxiv.org/abs/1706.03762">Attention is all you need</a> was published, in which the model **Bidirectional Encoder Representations from Transformers** (BERT) was introduced to achieve great results in a variety of NLP tasks.

- In 2019, Google introduced a new version of BERT called ALBERT, which achieved state-of-the-art results on a variety of natural language understanding tasks while using significantly fewer parameters than BERT. <em>Article: <a href="https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html">GoogleBlog.</a></em>

- In 2021, BERT was used in a study to identify and classify fake news on social media, achieving high accuracy (98.90% !!) and demonstrating the model's potential applications in the field of media and journalism. <em>Article: <a href="https://link.springer.com/article/10.1007/s11042-020-10183-2">FakeBERT</a></em>

**BERT vs GPT**

BERT is a pre-trained language model that is designed to learn contextual representations of words by jointly conditioning on both left and right context in all layers. It is trained using a masked language modeling objective, where some input tokens are randomly masked and the model is trained to predict the original tokens.

GPT is a generative language model that is trained to predict the next token in a sequence given the previous tokens. It uses a left-to-right training objective, where the model is conditioned only on the left context when predicting the next token.

</div></div>

### TF-IDF Example

Suppose we have a corpus of three documents about pets:

- Document 1: "My dog is grumpy"
- Document 2: "My cat is lazy"
- Document 3: "The dog is lazy"

We want to use TF-IDF to compare the similarity of these documents. First, we create a vocabulary of all the unique terms in the corpus.

In [81]:
import numpy as np
import pandas as pd

corpus = ["My dog is grumpy", "My cat is lazy", "The dog is lazy"]

#corpus2 = ["I LOVE this book about love", "No this book was okay"]

# Tokenize
documents = [doc.lower().split() for doc in corpus]

# Flatten
terms = list(set([term for doc in documents for term in doc]))

terms

['my', 'is', 'the', 'lazy', 'grumpy', 'dog', 'cat']

In [82]:
# Term frequency (TF) for each term in each document
tf = {}
for i, doc in enumerate(documents):
    tf[i] = {}
    for term in terms:
        tf[i][term] = doc.count(term)

tf_df = pd.DataFrame(tf).T
tf_df

Unnamed: 0,my,is,the,lazy,grumpy,dog,cat
0,1,1,0,0,1,1,0
1,1,1,0,1,0,0,1
2,0,1,1,1,0,1,0


In [83]:
# Inverse document frequency (IDF) for each term
# Logarithm of the ratio of the total n of documents to the n of documents containing the term

idf = {}
num_docs = len(documents)
for term in terms:
    num_docs_with_term = sum([1 for doc in documents if term in doc])
    idf[term] = np.log(num_docs / num_docs_with_term)

idf_df = pd.DataFrame(idf, index=['idf'])
idf_df

Unnamed: 0,my,is,the,lazy,grumpy,dog,cat
idf,0.405465,0.0,1.098612,0.405465,1.098612,0.405465,1.098612


In [84]:
# Calculate TF-IDF score for each term in each document (product of its TF and IDF scores)
tf_idf = {}
for i, doc in enumerate(documents):
    tf_idf[i] = {}
    for term in terms:
        tf_idf[i][term] = tf[i][term] * idf[term]

# Convert TF-IDF dictionary to DataFrame
tf_idf_df = pd.DataFrame(tf_idf).T
tf_idf_df

Unnamed: 0,my,is,the,lazy,grumpy,dog,cat
0,0.405465,0.0,0.0,0.0,1.098612,0.405465,0.0
1,0.405465,0.0,0.0,0.405465,0.0,0.0,1.098612
2,0.0,0.0,1.098612,0.405465,0.0,0.405465,0.0


If we have a text classification task where we want to classify documents into different categories, we could represent each document as a vector of TF-IDF scores for each term in the vocabulary. These vectors can then be used as input features to train a machine learning model such as a Naive Bayes classifier or a Support Vector Machine.

## Bag of words

In [46]:
review1 = "I LOVE this book about love"
review2 = "No this book was okay"

terms = [text.lower().split() for text in [review1, review2]]
terms

[['i', 'love', 'this', 'book', 'about', 'love'],
 ['no', 'this', 'book', 'was', 'okay']]

In [47]:
# Flatten
terms = [word for text in terms for word in text]

In [48]:
terms = set(terms)
terms

{'about', 'book', 'i', 'love', 'no', 'okay', 'this', 'was'}

In [51]:
vocabulary = {word: idx for idx, word in enumerate(terms)}
vocabulary

{'about': 0,
 'okay': 1,
 'i': 2,
 'book': 3,
 'no': 4,
 'love': 5,
 'this': 6,
 'was': 7}

In [52]:
# Vocabulary = corpus
# Doc = input review

def term_frequenct_vectorizer(doc, vocabulary):
    term_frequency = np.zeros(len(vocabulary))

    for word in doc.lower().split():
        index = vocabulary[word]
        term_frequency[index] += 1

    return term_frequency

In [55]:
review1_tf = term_frequenct_vectorizer(review1, vocabulary)
review2_tf = term_frequenct_vectorizer(review2, vocabulary)

review1_tf, review2_tf

(array([1., 0., 1., 1., 0., 2., 1., 0.]),
 array([0., 1., 0., 1., 1., 0., 1., 1.]))

In [57]:
review1, review2, vocabulary

('I LOVE this book about love',
 'No this book was okay',
 {'about': 0,
  'okay': 1,
  'i': 2,
  'book': 3,
  'no': 4,
  'love': 5,
  'this': 6,
  'was': 7})

In [61]:
bag_of_words = pd.DataFrame([review1_tf, review2_tf], columns=vocabulary.keys())
bag_of_words

Unnamed: 0,about,okay,i,book,no,love,this,was
0,1.0,0.0,1.0,1.0,0.0,2.0,1.0,0.0
1,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0


## Bag of words with sklearn

In [64]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()

bag_of_words_sparce_matrice = count_vectorizer.fit_transform([review1, review2])

In [65]:
bag_of_words_sparce_matrice.todense()

matrix([[1, 1, 2, 0, 0, 1, 0],
        [0, 1, 0, 1, 1, 1, 1]])

In [66]:
count_vectorizer.get_feature_names_out()

array(['about', 'book', 'love', 'no', 'okay', 'this', 'was'], dtype=object)

In [69]:
sk_bag_of_words = pd.DataFrame(bag_of_words_sparce_matrice.todense(), columns=count_vectorizer.get_feature_names_out())
sk_bag_of_words

Unnamed: 0,about,book,love,no,okay,this,was
0,1,1,2,0,0,1,0
1,0,1,0,1,1,1,1


## TF-IDF with sklearn

In [80]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit_transform([review1, review2]).todense()

matrix([[0.4078241 , 0.29017021, 0.81564821, 0.        , 0.        ,
         0.29017021, 0.        ],
        [0.        , 0.35520009, 0.        , 0.49922133, 0.49922133,
         0.35520009, 0.49922133]])

TF-IDF score is based on a term's frequency in a document and rarity in the corpus. A higher score means the term appears frequently in the document and is relatively rare in the corpus, making it more important in distinguishing the document from others.