# Embeddings 

After text has been preprocessed, the next step involves mapping this concise version of the text to numbers, namely vectors. These vector representations of words or phrases are called **embeddings**. There are endless ways to generate these vectors, but we’ll only highlight frequently used techniques.

## Counting Based Embedding Techniques 

For the most part, these methods for generating embeddings can be broken down into counting based approaches and more complex, neural network based approaches.Starting with the former, many of the counting based methods have been replaced by their neural network counterparts, but two of the somewhat still popular techniques are Term Frequency Inverse Document Frequency (TF-IDF)  and Bag of Words (BOW). Starting with TF-IDF, this method assigns a score to each word in a document based on its frequency and the frequency of the words in the corpus.Using the product of  term frequency and the inverse document frequency, TF-IDF measures the originality of a word.

It is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.
This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

# How to calculate the score:

![tfidf](images/tfidf.png)

## term frequency:
The term frequency of a word in a document. There are several ways of calculating this frequency, with the simplest being a raw count of instances a word appears in a document. Then, there are ways to adjust the frequency, by length of a document, or by the raw frequency of the most frequent word in a document.

## inverse document frquency:

- The inverse document frequency of the word across a set of documents. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm.

- So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular.
 
- So, if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1.

![tfidf](tfidf.png)

# Application:
## Information retrieval
TF-IDF was invented for document search and can be used to deliver results that are most relevant to what you’re searching for. Imagine you have a search engine and somebody looks for LeBron. The results will be displayed in order of relevance. That’s to say the most relevant sports articles will be ranked higher because TF-IDF gives the word LeBron a higher score.
## Keyword Extraction
TF-IDF is also useful for extracting keywords from text. How? The highest scoring words of a document are the most relevant to that document, and therefore they can be considered keywords for that document. Pretty straightforward.

In order to generate the *TF-IDF* vector, we'll need to rely on the `sklearn` library

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
corpus = ["Hey! I'm new in town."
          "Can you please point me in the direction of the groccery store"]

In [3]:
# Instantiate an instace of the TfidfVectorizer

vectorizer = TfidfVectorizer(use_idf=True)

# Fit the vectorizer to corputs 
fitted_vectorizer = vectorizer.fit(corpus)

# Transform the corpuse using the fit vectorizer 
X = fitted_vectorizer.transform(corpus)

In [6]:
# Retrieve the feature names 
fitted_vectorizer.get_feature_names_out()

array(['can', 'direction', 'groccery', 'hey', 'in', 'me', 'new', 'of',
       'please', 'point', 'store', 'the', 'town', 'you'], dtype=object)

Note that in the code above, the `fit` and `transform` calls for the vectorizer are broken down into two separate steps. Alternatively, the `fit_transform` method can combine these two steps into one. 

In [7]:
# Convert the sparse matrix into a Pandas DataFrame for later modeling 

import pandas as pd

df = pd.DataFrame(X[0].T.todense(), 
                  index = fitted_vectorizer.get_feature_names(), 
                  columns=["TF-IDF"]
                 )
print(df.sort_values("TF-IDF", ascending=False))

             TF-IDF
in         0.447214
the        0.447214
can        0.223607
direction  0.223607
groccery   0.223607
hey        0.223607
me         0.223607
new        0.223607
of         0.223607
please     0.223607
point      0.223607
store      0.223607
town       0.223607
you        0.223607




## Bag of Words

Alternatively, BOW describes the occurrence of the words within a document. It counts the frequency of words, ignoring grammar and order, and creates vectors that reflect the importance of words via their frequency in the document. 


![bow](images/bow.png)

Simliar to *TF-IDF*, we'll leverage `sklearn`'s `CountVectorizer` module

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate Count Vectorizer 
vectorizer = CountVectorizer()

# Fit to the corpus 
fitted_vectorizer = vectorizer.fit(corpus)

# Transform using fitted vectorizer 
X = fitted_vectorizer.transform(corpus)

In [14]:
# Convert the sparse matrix into a Pandas DataFrame for later modeling 

import pandas as pd

df = pd.DataFrame(X[0].T.todense(), 
                  index = fitted_vectorizer.get_feature_names(), 
                  columns=["Bag of Words"]
                 )
print(df.sort_values("Bag of Words", ascending=False))

           Bag of Words
in                    2
the                   2
can                   1
direction             1
groccery              1
hey                   1
me                    1
new                   1
of                    1
please                1
point                 1
store                 1
town                  1
you                   1


